What it costs to own your inference.
Public pricing from OpenAI, Anthropic, Together, Bedrock, and Replicate, vs. real GPU rental rates from RunPod, Lambda, and CoreWeave. Move the sliders to model your workload, the numbers update live.
The smaller, the cheaper. 8B and 24B match GPT-4-mini-class quality on many tasks. 70B matches mid-tier closed models. 405B is the only open option that approaches frontier.
Effective rate: $2.39/GPU-hr
How busy you keep the GPUs averaged over the month. Workload shape sets a starting value.
Shared system prompts / RAG context / agent loops. RAG and chatbot workloads typically 30–60%.
Monitoring, on-call, eval drift detection, hardware failure response. 25% for a team that already runs k8s. 35% for greenfield.
⚠ Below ~30M tokens/mo, the operational overhead of self-hosting outweighs the savings. Stay on the hosted API and revisit when volume grows.
Your 500,000,000 tokens / month, costed across 17 providers and your self-hosted setup
Sorted cheapest-first. Open-weights numbers assume parity with frontier models is product-acceptable for your workload, for some products it is, for others (hard reasoning, complex multi-step) it is not yet.
methodology & sources
How the math works
Self-hosted cost-per-million-tokens is computed per-token, not per-stack. The formula:
$/M_out = (gpu_count × per_gpu_hr × tier_mult) ÷ (peak_tps × util / 1e6 × 3600) × (1 + ops_overhead) × (1 - cache_hit_pct × 0.30) monthly = tokens_M × $/M_outThis assumes elastic compute (you only pay for the GPU-hours you actually use, which is the real model on RunPod, Modal, Lambda's serverless tier, and Vast.ai). Below ~30M tokens/month self-hosting is operationally not worth it regardless of math, and the calculator says so.
Pricing data
- Hosted API rates: each vendor's public pricing page (May 2026), cross-checked against artificialanalysis.ai.
- GPU rental rates: RunPod / Lambda / CoreWeave / AWS / Vast.ai pricing pages, May 2026.
- Throughput baselines: public vLLM 0.6.x and TensorRT-LLM benchmark posts; mid-concurrency, FP8 weights where supported.
Throughput numbers (output tok/s, FP8, mid-concurrency)
- Llama-3.1 8B / Mistral 7B (small): ~12,000 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
- Mistral-Small 24B / Qwen 32B (mid): ~8,000 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
- Llama-3.3 70B (default): ~4,500 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
- Llama-3.1 405B (frontier-class): ~2,200 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
What the calculator does NOT model
- Speculative decoding speedups (1.0–4× depending on workload — see our blog post).
- Reserved-instance discounts for AWS/GCP/Azure (up to 50% off on-demand).
- Bandwidth and storage costs, generally a rounding error vs. compute.
- Engineering time to migrate (one-time cost, see /services for ranges).
- Multi-region / failover redundancy (multiplier on stack count).
Stress-tested scenarios
| Scenario | Hosted | Self-hosted | Savings |
|---|---|---|---|
| 500M tokens/mo, GPT-4.1, Llama-70B on 2× H100 | $7,500 | ~$735 | ~90% |
| 500M tokens/mo, Together Llama-70B, same self-hosted | $440 | ~$735 | hosted wins |
| 500M tokens/mo, Bedrock Llama-70B, Llama-70B on 2× H100 spot | $360 | ~$455 | hosted wins (close) |
| 2B tokens/mo, Together Llama-70B, Llama-70B on 4× H100 | $1,760 | ~$1,290 | ~27% |
| 10B tokens/mo, GPT-4.1-mini, Llama-8B on 1× H100 | $16,000 | ~$340 | ~98% |
| 50M tokens/mo, anything, anything | — | flagged: too small | stay hosted |
The model says X%. The actual engagement says Y%.
The calculator above models a representative workload. Real engagements involve real prompt distributions, real cache hit rates, real model choice. The cost audit produces the real number, fixed-fee, refundable.
Start with an audit →