cost calculator

What it costs to own your inference.

Public pricing from OpenAI, Anthropic, Together, Bedrock, and Replicate, vs. real GPU rental rates from RunPod, Lambda, and CoreWeave. Move the sliders to model your workload, the numbers update live.

Monthly output tokens(500,000,000 per month)

M / mo

Current hosted provider

Model size to self-host

The smaller, the cheaper. 8B and 24B match GPT-4-mini-class quality on many tasks. 70B matches mid-tier closed models. 405B is the only open option that approaches frontier.

GPU config

Compute tier

Effective rate: $2.39/GPU-hr

Workload shape

Utilization(60%)

How busy you keep the GPUs averaged over the month. Workload shape sets a starting value.

Prefix cache hit rate(20%)

Shared system prompts / RAG context / agent loops. RAG and chatbot workloads typically 30–60%.

Ops overhead(25%)

Monitoring, on-call, eval drift detection, hardware failure response. 25% for a team that already runs k8s. 35% for greenfield.

your monthly bill today

$440.00

Together AI · Llama-3.3-70B · $0.880/M out

self-hosted estimate

$148.16

$0.296/M tokens · 2× H100 SXM on RunPod secure

66%savings

monthly$291.84

annual$3.5K

⚠ Below ~30M tokens/mo, the operational overhead of self-hosting outweighs the savings. Stay on the hosted API and revisit when volume grows.

compare

Your 500,000,000 tokens / month, costed across 17 providers and your self-hosted setup

Sorted cheapest-first. Open-weights numbers assume parity with frontier models is product-acceptable for your workload, for some products it is, for others (hard reasoning, complex multi-step) it is not yet.

1Self-hosted · Llama-3.3 70B on 2× H100 SXMOwned

$148.16

2Google · Gemini 2.5 FlashMid-tier

$200.00

3AWS Bedrock · Llama-3.3-70BOpen-weights hosted

$360.00

4Groq · Llama-3.3-70B (fast)Open-weights hosted

$395.00

5Together AI · Llama-3.3-70BOpen-weights hostedyour selection

$440.00

6Fireworks · Llama-3.3-70BOpen-weights hosted

$450.00

7HF Inference Endpoints · Llama-3.3-70BOpen-weights hosted

$500.00

8DeepSeek · V3Mid-tier

$550.00

9Replicate · Llama-3.3-70BOpen-weights hosted

$590.00

10OpenAI · GPT-4.1 miniMid-tier

$800.00

11Anthropic · Claude Haiku 4.5Mid-tier

$2.5K

12Mistral · Large 2Mid-tier

$3.0K

13Google · Gemini 2.5 ProFrontier

$5.0K

14Cohere · Command R+Mid-tier

$5.0K

15OpenAI · GPT-4.1Frontier

$7.5K

16Anthropic · Claude Sonnet 4.6Frontier

$7.5K

17xAI · Grok 4Frontier

$7.5K

18Anthropic · Claude Opus 4.6Frontier

$13K

methodology & sources

How the math works

Self-hosted cost-per-million-tokens is computed per-token, not per-stack. The formula:

$/M_out = (gpu_count × per_gpu_hr × tier_mult) ÷ (peak_tps × util / 1e6 × 3600) × (1 + ops_overhead) × (1 - cache_hit_pct × 0.30) monthly = tokens_M × $/M_out

This assumes elastic compute (you only pay for the GPU-hours you actually use, which is the real model on RunPod, Modal, Lambda's serverless tier, and Vast.ai). Below ~30M tokens/month self-hosting is operationally not worth it regardless of math, and the calculator says so.

Pricing data

Hosted API rates: each vendor's public pricing page (May 2026), cross-checked against artificialanalysis.ai.
GPU rental rates: RunPod / Lambda / CoreWeave / AWS / Vast.ai pricing pages, May 2026.
Throughput baselines: public vLLM 0.6.x and TensorRT-LLM benchmark posts; mid-concurrency, FP8 weights where supported.

Throughput numbers (output tok/s, FP8, mid-concurrency)

Llama-3.1 8B / Mistral 7B (small): ~12,000 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
Mistral-Small 24B / Qwen 32B (mid): ~8,000 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
Llama-3.3 70B (default): ~4,500 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.
Llama-3.1 405B (frontier-class): ~2,200 tok/s reference (1× H100 SXM equivalent), scales by GPU config below.

What the calculator does NOT model

Speculative decoding speedups (1.0–4× depending on workload — see our blog post).
Reserved-instance discounts for AWS/GCP/Azure (up to 50% off on-demand).
Bandwidth and storage costs, generally a rounding error vs. compute.
Engineering time to migrate (one-time cost, see /services for ranges).
Multi-region / failover redundancy (multiplier on stack count).

Stress-tested scenarios

Scenario	Hosted	Self-hosted	Savings
500M tokens/mo, GPT-4.1, Llama-70B on 2× H100	$7,500	~$735	~90%
500M tokens/mo, Together Llama-70B, same self-hosted	$440	~$735	hosted wins
500M tokens/mo, Bedrock Llama-70B, Llama-70B on 2× H100 spot	$360	~$455	hosted wins (close)
2B tokens/mo, Together Llama-70B, Llama-70B on 4× H100	$1,760	~$1,290	~27%
10B tokens/mo, GPT-4.1-mini, Llama-8B on 1× H100	$16,000	~$340	~98%
50M tokens/mo, anything, anything	—	flagged: too small	stay hosted

The model says X%. The actual engagement says Y%.

The calculator above models a representative workload. Real engagements involve real prompt distributions, real cache hit rates, real model choice. The cost audit produces the real number, fixed-fee, refundable.

Start with an audit →

What it costs to own your inference.

Your 500,000,000 tokens / month, costed across 17 providers and your self-hosted setup

Want this validated against your real workload?

How the math works

Pricing data

Throughput numbers (output tok/s, FP8, mid-concurrency)

What the calculator does NOT model

Stress-tested scenarios

The model says X%. The actual engagement says Y%.