FastPriors
Talk to an engineer
cost calculator

What it costs to own your inference.

Public pricing from OpenAI, Anthropic, Together, Bedrock, and Replicate, vs. real GPU rental rates from RunPod, Lambda, and CoreWeave. Move the sliders to model your workload, the numbers update live.

M / mo

The smaller, the cheaper. 8B and 24B match GPT-4-mini-class quality on many tasks. 70B matches mid-tier closed models. 405B is the only open option that approaches frontier.

Effective rate: $2.39/GPU-hr

How busy you keep the GPUs averaged over the month. Workload shape sets a starting value.

Shared system prompts / RAG context / agent loops. RAG and chatbot workloads typically 30–60%.

Monitoring, on-call, eval drift detection, hardware failure response. 25% for a team that already runs k8s. 35% for greenfield.

your monthly bill today
$440.00
Together AI · Llama-3.3-70B · $0.880/M out
self-hosted estimate
$148.16
$0.296/M tokens · 2× H100 SXM on RunPod secure
66%savings
monthly$291.84
annual$3.5K

⚠ Below ~30M tokens/mo, the operational overhead of self-hosting outweighs the savings. Stay on the hosted API and revisit when volume grows.

compare

Your 500,000,000 tokens / month, costed across 17 providers and your self-hosted setup

Sorted cheapest-first. Open-weights numbers assume parity with frontier models is product-acceptable for your workload, for some products it is, for others (hard reasoning, complex multi-step) it is not yet.

1Self-hosted · Llama-3.3 70B on 2× H100 SXMOwned
$148.16
2Google · Gemini 2.5 FlashMid-tier
$200.00
3AWS Bedrock · Llama-3.3-70BOpen-weights hosted
$360.00
4Groq · Llama-3.3-70B (fast)Open-weights hosted
$395.00
5Together AI · Llama-3.3-70BOpen-weights hostedyour selection
$440.00
6Fireworks · Llama-3.3-70BOpen-weights hosted
$450.00
7HF Inference Endpoints · Llama-3.3-70BOpen-weights hosted
$500.00
8DeepSeek · V3Mid-tier
$550.00
9Replicate · Llama-3.3-70BOpen-weights hosted
$590.00
10OpenAI · GPT-4.1 miniMid-tier
$800.00
11Anthropic · Claude Haiku 4.5Mid-tier
$2.5K
12Mistral · Large 2Mid-tier
$3.0K
13Google · Gemini 2.5 ProFrontier
$5.0K
14Cohere · Command R+Mid-tier
$5.0K
15OpenAI · GPT-4.1Frontier
$7.5K
16Anthropic · Claude Sonnet 4.6Frontier
$7.5K
17xAI · Grok 4Frontier
$7.5K
18Anthropic · Claude Opus 4.6Frontier
$13K

Want this validated against your real workload?

Drop your work email, we'll send a 1-page report with these assumptions replaced by your actual model, traffic distribution, and prompt patterns. Free.

methodology & sources

How the math works

Self-hosted cost-per-million-tokens is computed per-token, not per-stack. The formula:

$/M_out = (gpu_count × per_gpu_hr × tier_mult) ÷ (peak_tps × util / 1e6 × 3600) × (1 + ops_overhead) × (1 - cache_hit_pct × 0.30) monthly = tokens_M × $/M_out

This assumes elastic compute (you only pay for the GPU-hours you actually use, which is the real model on RunPod, Modal, Lambda's serverless tier, and Vast.ai). Below ~30M tokens/month self-hosting is operationally not worth it regardless of math, and the calculator says so.

Pricing data

  • Hosted API rates: each vendor's public pricing page (May 2026), cross-checked against artificialanalysis.ai.
  • GPU rental rates: RunPod / Lambda / CoreWeave / AWS / Vast.ai pricing pages, May 2026.
  • Throughput baselines: public vLLM 0.6.x and TensorRT-LLM benchmark posts; mid-concurrency, FP8 weights where supported.

Throughput numbers (output tok/s, FP8, mid-concurrency)

  • Llama-3.1 8B / Mistral 7B (small): ~12,000 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
  • Mistral-Small 24B / Qwen 32B (mid): ~8,000 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
  • Llama-3.3 70B (default): ~4,500 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
  • Llama-3.1 405B (frontier-class): ~2,200 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.

What the calculator does NOT model

  • Speculative decoding speedups (1.0–4× depending on workload — see our blog post).
  • Reserved-instance discounts for AWS/GCP/Azure (up to 50% off on-demand).
  • Bandwidth and storage costs, generally a rounding error vs. compute.
  • Engineering time to migrate (one-time cost, see /services for ranges).
  • Multi-region / failover redundancy (multiplier on stack count).

Stress-tested scenarios

ScenarioHostedSelf-hostedSavings
500M tokens/mo, GPT-4.1, Llama-70B on 2× H100$7,500~$735~90%
500M tokens/mo, Together Llama-70B, same self-hosted$440~$735hosted wins
500M tokens/mo, Bedrock Llama-70B, Llama-70B on 2× H100 spot$360~$455hosted wins (close)
2B tokens/mo, Together Llama-70B, Llama-70B on 4× H100$1,760~$1,290~27%
10B tokens/mo, GPT-4.1-mini, Llama-8B on 1× H100$16,000~$340~98%
50M tokens/mo, anything, anythingflagged: too smallstay hosted

The model says X%. The actual engagement says Y%.

The calculator above models a representative workload. Real engagements involve real prompt distributions, real cache hit rates, real model choice. The cost audit produces the real number, fixed-fee, refundable.

Start with an audit →