OpenAI vs Self-Hosted, When the Math Actually Works, FastPriors

The four signals

You should consider self-hosting when at least three of these are true:

Your monthly hosted bill is ≥ $10K. Below that, the operational overhead of running your own GPUs is not worth it.
Your traffic is consistent. If your DAU is 10× higher on weekdays, hosted is more efficient, you do not pay for idle GPUs you do not own.
Open-weights models can hit your eval bar. Llama-3.3-70B, Mistral-Small-24B, Qwen-32B can match GPT-4-mini-class quality for many product categories. They cannot replace GPT-4.1 or Claude Opus 4.6 on hard reasoning yet.
Latency or sovereignty matters more than convenience. Sub-50ms p95? Air-gapped deployment? Audit-ready data flow? Hosted APIs cannot deliver.

Where hosted still wins

Spiky or low-volume traffic. If you serve 100K tokens/day, hosted is ~$5/month. Self-hosted is at least an idle 1× H100 = ~$1,500/month. The math does not work.
Frontier capability. GPT-4.1 and Claude Opus 4.6 still beat the best open-weights model on the hardest reasoning, multi-step coding, and long-context tasks. If your product depends on that ceiling, you stay hosted (or do a hybrid).
Multi-modal at the frontier. Vision, audio, and tool-use chains are still smoother on hosted APIs than self-hosted.
Compliance edge cases where the hosted vendor is already certified for your regulator. Sometimes the box you need is already ticked by Bedrock or Vertex.

The pricing reality, May 2026

Public list prices, $ per 1M tokens (output unless noted):

Provider / model	Input $/M	Output $/M	Class
OpenAI GPT-4.1	$5.00	$15.00	frontier
OpenAI GPT-4.1 mini	$0.40	$1.60	mid-tier
Anthropic Claude Opus 4.6	$5.00	$25.00	frontier
Anthropic Claude Sonnet 4.6	$3.00	$15.00	mid-tier
Together · Llama-3.3-70B	$0.88	$0.88	open-hosted
Bedrock · Llama-3.3-70B	$0.72	$0.72	open-hosted
Self-hosted Llama-3.3-70B	~$0.40	~$0.45	self-hosted (modeled)

Self-hosted line based on 2× H100 SXM at ~$2.39/GPU-hr (RunPod secure), 4,000 sustained output tok/s, 60% utilization, 25% ops overhead. Run your own numbers in the calculator.

The hidden costs of self-hosting (and why they're smaller than you'd think)

The honest list of things self-hosting introduces that you do not have to think about with a hosted API:

Capacity planning and autoscaling
Eval drift detection (your model does not magically improve over time)
Patching, dependency upgrades, runtime version management
On-call for the inference path itself, not just your application
GPU-specific debugging when things behave unexpectedly

In practice, for a team running ≥$30K/mo of inference, this is one engineer at ~10–20% time. That is real cost, but it is dwarfed by the savings at that scale, and the operational maturity carries over to whatever else you run on those GPUs.

Hybrid is usually the right answer

Most teams that succeed at this do not migrate 100% of their traffic. They self-host the predictable, high-volume base load and route the long tail (hard prompts, escalations, frontier-only capability) to a hosted API. Concretely:

80% of traffic → Llama-3.3-70B / Mistral-Small / Qwen-32B on owned GPUs
20% of traffic → GPT-4.1 / Claude Opus / Gemini Ultra via hosted, gated by a router

The router decides per request based on (a) prompt features, (b) confidence on the open model, (c) historical success rate. We have shipped routers as simple as 30 lines of Python and as complex as a learned classifier, both work. The hybrid setup keeps the cost win without sacrificing the frontier ceiling.

What you give up by self-hosting

Honest inventory:

Free upgrades. When OpenAI ships GPT-4.2, you get it on day one. With self-hosted, you wait for the next open-weights release that hits your bar, and you do the migration work yourself.
The compliance halo. "We use OpenAI" is shorthand for "a SOC-2-certified vendor handles this." Self-hosted means your team owns those answers.
Throwaway prototyping. "Let me try this with a different model" is one line of code on hosted; on self-hosted, it is a config change and a redeploy.
The community knowledge base. When something weird happens with GPT-4.1, you can find a thousand Stack Overflow posts. With your own vLLM setup at scale, the answer might only exist in a Slack you have to join.

Most of these are real but not blocking. They are reasons to do the migration carefully, not reasons not to do it.

The honest break-even chart

Approximate break-even point where self-hosted Llama-3.3-70B beats hosted Llama-3.3-70B (Together / Bedrock):

~50M tokens/month, break-even on a single 2× H100 stack at 60% utilization.
~150M tokens/month, clear ~40% savings even after ops overhead.
~500M tokens/month, savings approach 70% with prefix caching and quantization.
~1B+ tokens/month, bare-metal becomes economical; savings approach 80%.

For frontier models (GPT-4.1, Claude Opus), open-weights cannot match the quality, so the comparison is moot. For mid-tier models, open-weights wins on cost across the board past ~50M tokens/month.

What we'd ask before you migrate

What is your current monthly bill, broken down by model?
What does your traffic distribution look like, peak vs trough vs median?
What evals are you currently running? Can we extend them to cover the new candidate?
Do you have an existing GPU footprint or are we sizing from scratch?
What is your latency budget at p95? At p99?
Are there compliance constraints that affect deployment topology?

These five questions are the audit. We answer them in writing, with numbers, before any code moves.

OpenAI vs self-hosted.