What it costs to
own your inference.
Public pricing from OpenAI, Anthropic, Together, Bedrock, and Replicate, vs. real GPU rental rates from RunPod, Lambda, and CoreWeave. Move the sliders to model your workload, the numbers update live.
Default workhorse. Roughly Sonnet-class on most evals. Qwen 2.5 72B is interchangeable.
FP8 is the production default on H100/H200/B200 (near-zero accuracy loss). INT4 is smaller still (AWQ/GPTQ) at a small accuracy cost. BF16 is full precision.
Fits: 70GB weights + KV/activations ≤ 160GB VRAM (58GB headroom).
Effective rate: $2.69/GPU-hr × 2 GPUs = $5.38/hr stack
How busy you keep the GPUs averaged over the month. Workload shape sets a starting value.
Shared system prompts / RAG context / agent loops. RAG and chatbot workloads typically 30-60%.
Monitoring, on-call, eval drift detection, hardware failure response. 25% for a team that already runs k8s. 35% for greenfield.
Your 500,000,000 tokens / month, ranked across providers and your self-hosted stack.
Cheapest first. Filter by class for fair like-for-like comparison. Your API and self-hosted rows stay pinned across filters.
methodology & sources
How the math works
Two costs, the higher one wins.
per-token cost ($/M out) =
gpu_count × $/gpu-hr × tier_mult × 1,000,000
÷ ( peak_tps × utilisation × 3600 )
× ( 1 + ops_overhead )
× ( 1 − cache_hit_rate × 0.30 )
stack rental floor ($/mo) =
gpu_count × $/gpu-hr × tier_mult × 730 hr
× ( 1 + ops_overhead )
monthly bill = max( stack rental floor, tokens_M × per-token cost )The rental floor is the load-bearing assumption. You rent the GPU stack for the full month whether you put 10M tokens or 10B through it; the per-token cost only beats the floor once volume is high enough to fill the rented capacity. Break-even thresholds for the default stack are tabulated below.
Pricing & benchmark sources
Every (model × GPU stack × quantisation) cell shown by this calculator is generated from scripts/validate-calc-data.mjs, which cross-references published benchmarks across multiple sources (NVIDIA TRT-LLM perf, vLLM, SGLang, MLPerf Inference v4.1, Artificial Analysis, AMD ROCm, vendor datasheets) and rejects any cell with fewer than 3 independent sources or >25% disagreement. Hosted API prices are audited against each vendor's public pricing page (Anthropic, OpenAI, Google, xAI, Cohere, Mistral, DeepSeek). Full source list: scripts/sources/*.md in the site repo. Last refreshed 2026-05-10.
Throughput grid (median aggregate tok/s, all-confidence: high)
Selected high-confidence cells from the validated grid. The calculator uses the full grid live; this is a small representative slice.
| Model | Stack | Quant | Aggregate tok/s | Sources |
|---|---|---|---|---|
| Llama 3.1 405B Instruct | 8× H100 SXM5 | FP8 | ~12,760 | 3 |
| Llama 3.1 8B Instruct | 1× B200 SXM6 | BF16 | ~17,000 | 3 |
| Llama 3.1 8B Instruct | 1× B200 SXM6 | FP8 | ~22,000 | 3 |
| Llama 3.1 8B Instruct | 2× B200 SXM6 | BF16 | ~31,450 | 3 |
| Llama 3.1 8B Instruct | 2× B200 SXM6 | FP8 | ~40,700 | 3 |
| Llama 3.1 8B Instruct | 1× H100 SXM5 | BF16 | ~8,800 | 3 |
| Llama 3.1 8B Instruct | 1× H100 SXM5 | FP8 | ~12,500 | 3 |
| Llama 3.1 8B Instruct | 2× H100 SXM5 | BF16 | ~16,280 | 3 |
| Llama 3.1 8B Instruct | 2× H100 SXM5 | FP8 | ~23,125 | 3 |
| Llama 3.1 8B Instruct | 4× H100 SXM5 | BF16 | ~29,920 | 3 |
| Llama 3.1 8B Instruct | 4× H100 SXM5 | FP8 | ~42,500 | 3 |
| Llama 3.1 8B Instruct | 8× H100 SXM5 | BF16 | ~51,040 | 3 |
| Llama 3.1 8B Instruct | 8× H100 SXM5 | FP8 | ~72,500 | 3 |
| Llama 3.1 8B Instruct | 2× H200 SXM5 | BF16 | ~19,425 | 3 |
| Llama 3.1 8B Instruct | 2× H200 SXM5 | FP8 | ~26,825 | 3 |
| Llama 3.1 8B Instruct | 2× AMD Instinct MI300X | BF16 | ~14,800 | 3 |
| Llama 3.1 8B Instruct | 2× AMD Instinct MI300X | FP8 | ~18,500 | 3 |
| Llama 3.3 70B Instruct | 1× B200 SXM6 | FP8 | ~6,500 | 3 |
| Llama 3.3 70B Instruct | 2× B200 SXM6 | BF16 | ~10,175 | 3 |
| Llama 3.3 70B Instruct | 2× B200 SXM6 | FP8 | ~12,025 | 3 |
| Llama 3.3 70B Instruct | 2× H100 SXM5 | FP8 | ~3,996 | 3 |
| Llama 3.3 70B Instruct | 4× H100 SXM5 | BF16 | ~2,992 | 3 |
| Llama 3.3 70B Instruct | 4× H100 SXM5 | FP8 | ~7,344 | 3 |
| Llama 3.3 70B Instruct | 8× H100 SXM5 | BF16 | ~5,104 | 3 |
| Llama 3.3 70B Instruct | 8× H100 SXM5 | FP8 | ~12,528 | 3 |
| Llama 3.3 70B Instruct | 2× H200 SXM5 | BF16 | ~1,573 | 3 |
| Llama 3.3 70B Instruct | 2× H200 SXM5 | FP8 | ~7,030 | 3 |
| Llama 3.3 70B Instruct | 2× AMD Instinct MI300X | BF16 | ~4,070 | 3 |
| Llama 3.3 70B Instruct | 2× AMD Instinct MI300X | FP8 | ~6,475 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 1× B200 SXM6 | FP8 | ~13,000 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 2× B200 SXM6 | FP8 | ~24,050 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 1× H100 SXM5 | BF16 | ~4,500 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 1× H100 SXM5 | FP8 | ~6,500 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 2× H100 SXM5 | BF16 | ~8,325 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 2× H100 SXM5 | FP8 | ~12,025 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 4× H100 SXM5 | BF16 | ~15,300 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 4× H100 SXM5 | FP8 | ~22,100 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 8× H100 SXM5 | BF16 | ~26,100 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 8× H100 SXM5 | FP8 | ~37,700 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 2× H200 SXM5 | BF16 | ~10,175 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 2× H200 SXM5 | FP8 | ~13,875 | 3 |
| Mistral Small 24B / Qwen 2.5 32B | 2× AMD Instinct MI300X | FP8 | ~10,730 | 3 |
What the calculator does NOT model
- Speculative decoding speedups (1.0-2× depending on workload).
- Reserved-instance discounts for AWS/GCP/Azure (up to 50% off on-demand).
- Bandwidth and storage costs, generally a rounding error vs. compute.
- Engineering time to migrate (one-time cost, see /services for ranges).
- Multi-region / failover redundancy (multiplier on stack count).
- Reasoning-token output overhead for o3 / R1 (the hosted price already reflects it).
Default stack break-evens
On the default stack (2× H100 SXM5, RunPod-secure $2.69/GPU-hr, 50% utilisation, 25% ops overhead), the rental floor is ~$4,910/mo. Volume × hosted-rate must exceed that floor before self-hosting saves money:
- vs GPT-5.5 ($30/M out): break-even ~164M tok/mo
- vs Claude Opus 4.7 ($25/M out): break-even ~196M tok/mo
- vs Sonnet 4.6 / GPT-5.4 ($15/M out): break-even ~327M tok/mo
- vs Gemini 2.5 Pro ($10/M out): break-even ~491M tok/mo
- vs Haiku 4.5 ($5/M out): break-even ~982M tok/mo
- vs GPT-5.4 mini ($4.50/M out): break-even ~1.09B tok/mo
- vs Grok 4.3 ($2.50/M out): break-even ~1.96B tok/mo
- vs Bedrock Llama-70B ($0.72/M out): break-even ~6.8B tok/mo (essentially never; open-weights hosted is already very cheap, the case for self-hosting is data sovereignty, not cost)
Below ~30M tok/mo, the operational overhead (eval harness, on-call rotation, drift dashboards) costs more than the inference savings even when the math works. The calculator surfaces this separately.