FastPriors
cost calculator

What it costs to
own your inference.

Public pricing from OpenAI, Anthropic, Together, Bedrock, and Replicate, vs. real GPU rental rates from RunPod, Lambda, and CoreWeave. Move the sliders to model your workload, the numbers update live.

M / mo

Default workhorse. Roughly Sonnet-class on most evals. Qwen 2.5 72B is interchangeable.

FP8 is the production default on H100/H200/B200 (near-zero accuracy loss). INT4 is smaller still (AWQ/GPTQ) at a small accuracy cost. BF16 is full precision.

Fits: 70GB weights + KV/activations ≤ 160GB VRAM (58GB headroom).

Effective rate: $2.69/GPU-hr × 2 GPUs = $5.38/hr stack

How busy you keep the GPUs averaged over the month. Workload shape sets a starting value.

Shared system prompts / RAG context / agent loops. RAG and chatbot workloads typically 30-60%.

Monitoring, on-call, eval drift detection, hardware failure response. 25% for a team that already runs k8s. 35% for greenfield.

your monthly bill today
$13K
Anthropic · Claude Opus 4.7 · $25.00/M out
self-hosted estimatehigh
$4.9K
$0.879/M tokens · 2× H100 SXM5 (FP8) on RunPod secure
61%savings
monthly$7.6K
annual$91K
compare

Your 500,000,000 tokens / month, ranked across providers and your self-hosted stack.

Cheapest first. Filter by class for fair like-for-like comparison. Your API and self-hosted rows stay pinned across filters.

1xAI · Grok 4.3Frontier
$1.3K
2Self-hosted · Llama 3.3 70B Instruct on 2× H100 SXM5 (FP8)Ownedyour stack
$4.9K
3Google · Gemini 2.5 Pro (≤200K ctx)Frontier
$5.0K
4Google · Gemini 3.1 Pro Preview (≤200K ctx)Frontier
$6.0K
5OpenAI · GPT-5.4Frontier
$7.5K
6Anthropic · Claude Sonnet 4.6Frontier
$7.5K
7Anthropic · Claude Sonnet 4.5Frontier
$7.5K
8Anthropic · Claude Sonnet 4Frontier
$7.5K
9Anthropic · Claude Opus 4.7Frontieryour API
$13K
10Anthropic · Claude Opus 4.6Frontier
$13K
11Anthropic · Claude Opus 4.5Frontier
$13K
12OpenAI · GPT-5.5Frontier
$15K
13Anthropic · Claude Opus 4.1 (legacy $15/$75)Frontier
$38K

Want a case study at this scale?

Drop your work email. We'll send an anonymised case study from one of our engagements, close to what you've modelled above.

methodology & sources

How the math works

Two costs, the higher one wins.

per-token cost ($/M out) =
    gpu_count × $/gpu-hr × tier_mult × 1,000,000
  ÷ ( peak_tps × utilisation × 3600 )
  × ( 1 + ops_overhead )
  × ( 1 − cache_hit_rate × 0.30 )

stack rental floor ($/mo) =
    gpu_count × $/gpu-hr × tier_mult × 730 hr
  × ( 1 + ops_overhead )

monthly bill = max( stack rental floor, tokens_M × per-token cost )

The rental floor is the load-bearing assumption. You rent the GPU stack for the full month whether you put 10M tokens or 10B through it; the per-token cost only beats the floor once volume is high enough to fill the rented capacity. Break-even thresholds for the default stack are tabulated below.

Pricing & benchmark sources

Every (model × GPU stack × quantisation) cell shown by this calculator is generated from scripts/validate-calc-data.mjs, which cross-references published benchmarks across multiple sources (NVIDIA TRT-LLM perf, vLLM, SGLang, MLPerf Inference v4.1, Artificial Analysis, AMD ROCm, vendor datasheets) and rejects any cell with fewer than 3 independent sources or >25% disagreement. Hosted API prices are audited against each vendor's public pricing page (Anthropic, OpenAI, Google, xAI, Cohere, Mistral, DeepSeek). Full source list: scripts/sources/*.md in the site repo. Last refreshed 2026-05-10.

Throughput grid (median aggregate tok/s, all-confidence: high)

Selected high-confidence cells from the validated grid. The calculator uses the full grid live; this is a small representative slice.

ModelStackQuantAggregate tok/sSources
Llama 3.1 405B Instruct8× H100 SXM5FP8~12,7603
Llama 3.1 8B Instruct1× B200 SXM6BF16~17,0003
Llama 3.1 8B Instruct1× B200 SXM6FP8~22,0003
Llama 3.1 8B Instruct2× B200 SXM6BF16~31,4503
Llama 3.1 8B Instruct2× B200 SXM6FP8~40,7003
Llama 3.1 8B Instruct1× H100 SXM5BF16~8,8003
Llama 3.1 8B Instruct1× H100 SXM5FP8~12,5003
Llama 3.1 8B Instruct2× H100 SXM5BF16~16,2803
Llama 3.1 8B Instruct2× H100 SXM5FP8~23,1253
Llama 3.1 8B Instruct4× H100 SXM5BF16~29,9203
Llama 3.1 8B Instruct4× H100 SXM5FP8~42,5003
Llama 3.1 8B Instruct8× H100 SXM5BF16~51,0403
Llama 3.1 8B Instruct8× H100 SXM5FP8~72,5003
Llama 3.1 8B Instruct2× H200 SXM5BF16~19,4253
Llama 3.1 8B Instruct2× H200 SXM5FP8~26,8253
Llama 3.1 8B Instruct2× AMD Instinct MI300XBF16~14,8003
Llama 3.1 8B Instruct2× AMD Instinct MI300XFP8~18,5003
Llama 3.3 70B Instruct1× B200 SXM6FP8~6,5003
Llama 3.3 70B Instruct2× B200 SXM6BF16~10,1753
Llama 3.3 70B Instruct2× B200 SXM6FP8~12,0253
Llama 3.3 70B Instruct2× H100 SXM5FP8~3,9963
Llama 3.3 70B Instruct4× H100 SXM5BF16~2,9923
Llama 3.3 70B Instruct4× H100 SXM5FP8~7,3443
Llama 3.3 70B Instruct8× H100 SXM5BF16~5,1043
Llama 3.3 70B Instruct8× H100 SXM5FP8~12,5283
Llama 3.3 70B Instruct2× H200 SXM5BF16~1,5733
Llama 3.3 70B Instruct2× H200 SXM5FP8~7,0303
Llama 3.3 70B Instruct2× AMD Instinct MI300XBF16~4,0703
Llama 3.3 70B Instruct2× AMD Instinct MI300XFP8~6,4753
Mistral Small 24B / Qwen 2.5 32B1× B200 SXM6FP8~13,0003
Mistral Small 24B / Qwen 2.5 32B2× B200 SXM6FP8~24,0503
Mistral Small 24B / Qwen 2.5 32B1× H100 SXM5BF16~4,5003
Mistral Small 24B / Qwen 2.5 32B1× H100 SXM5FP8~6,5003
Mistral Small 24B / Qwen 2.5 32B2× H100 SXM5BF16~8,3253
Mistral Small 24B / Qwen 2.5 32B2× H100 SXM5FP8~12,0253
Mistral Small 24B / Qwen 2.5 32B4× H100 SXM5BF16~15,3003
Mistral Small 24B / Qwen 2.5 32B4× H100 SXM5FP8~22,1003
Mistral Small 24B / Qwen 2.5 32B8× H100 SXM5BF16~26,1003
Mistral Small 24B / Qwen 2.5 32B8× H100 SXM5FP8~37,7003
Mistral Small 24B / Qwen 2.5 32B2× H200 SXM5BF16~10,1753
Mistral Small 24B / Qwen 2.5 32B2× H200 SXM5FP8~13,8753
Mistral Small 24B / Qwen 2.5 32B2× AMD Instinct MI300XFP8~10,7303

What the calculator does NOT model

  • Speculative decoding speedups (1.0-2× depending on workload).
  • Reserved-instance discounts for AWS/GCP/Azure (up to 50% off on-demand).
  • Bandwidth and storage costs, generally a rounding error vs. compute.
  • Engineering time to migrate (one-time cost, see /services for ranges).
  • Multi-region / failover redundancy (multiplier on stack count).
  • Reasoning-token output overhead for o3 / R1 (the hosted price already reflects it).

Default stack break-evens

On the default stack (2× H100 SXM5, RunPod-secure $2.69/GPU-hr, 50% utilisation, 25% ops overhead), the rental floor is ~$4,910/mo. Volume × hosted-rate must exceed that floor before self-hosting saves money:

  • vs GPT-5.5 ($30/M out): break-even ~164M tok/mo
  • vs Claude Opus 4.7 ($25/M out): break-even ~196M tok/mo
  • vs Sonnet 4.6 / GPT-5.4 ($15/M out): break-even ~327M tok/mo
  • vs Gemini 2.5 Pro ($10/M out): break-even ~491M tok/mo
  • vs Haiku 4.5 ($5/M out): break-even ~982M tok/mo
  • vs GPT-5.4 mini ($4.50/M out): break-even ~1.09B tok/mo
  • vs Grok 4.3 ($2.50/M out): break-even ~1.96B tok/mo
  • vs Bedrock Llama-70B ($0.72/M out): break-even ~6.8B tok/mo (essentially never; open-weights hosted is already very cheap, the case for self-hosting is data sovereignty, not cost)

Below ~30M tok/mo, the operational overhead (eval harness, on-call rotation, drift dashboards) costs more than the inference savings even when the math works. The calculator surfaces this separately.