What it actually costs to serve Llama-3.3-70B yourself in 2026
Public GPU rental rates × public benchmarks × honest utilization assumptions. The number lands at around $0.45/M output tokens, half of Bedrock, a third of Replicate.
The single most asked question we get from prospective customers: what does it actually cost to serve a 70B-class open-weights model on owned GPUs? The published vendor numbers are not directly comparable, the public benchmarks use idealized configurations, and the math involves several variables that customers reasonably do not want to estimate on their own.
Here is the answer, with the work shown. The headline: ~$0.45 per 1M output tokens for Llama-3.3-70B FP8 on 2× H100 SXM at realistic utilization, with 25% ops overhead included.
#The inputs
GPU rental cost. H100 SXM secure-tier from RunPod is $2.39/GPU-hour as of May 2026. Two GPUs = $4.78/hour for the inference stack.
Sustainable throughput. Public vLLM 0.6.x benchmarks for Llama-3.3-70B FP8 on 2× H100 SXM at mid concurrency: ~6,000 output tok/s peak. We use 4,000 as the planning number, it derates for realistic prompt distributions, partial saturation, and the difference between published "peak" and sustained-through-day numbers.
Utilization. Single-product workloads typically achieve 50–70% utilization without aggressive multi-tenant batching. We model 60%.
Ops overhead. Monitoring, alerting, eval drift detection, on-call rotation, hardware failure response. We add 25% to the GPU bill to cover this. Customers running a mature inference platform on top of other GPUs amortize closer to 15%; teams starting from scratch land closer to 35%.
#The math
monthly hours: 720
gpu cost per stack: $4.78/hr × 720 = $3,442
sustained output tps: 4,000 × 0.60 = 2,400 tok/s
monthly tokens: 2,400 × 3,600 × 720 = 6.22B output tokens
$ per 1M tokens: $3,442 / 6,220 = $0.55
+ ops overhead: × 1.25 = $0.69 per 1M tokens
Wait, that is $0.69, not $0.45. The difference is prefix caching.
#Prefix caching is half the answer
Most production inference workloads have shareable prefixes, system prompts, retrieved context, multi-turn chat history. Prefix caching skips the prefill work for the matching portion of the input. Public benchmarks of vLLM's paged-attention prefix cache show 5–12× speedup on the cached portion, varying with cache hit rate.
For a typical RAG or chatbot workload, cache hit rates run 30–60%. Modeling 40% with an effective 8× speedup on hits:
effective capacity: 2,400 × (1 / (1 - 0.40 × 0.875)) = 3,712 tok/s
monthly tokens: 3,712 × 3,600 × 720 = 9.62B output tokens
$ per 1M tokens: $3,442 / 9,620 = $0.36
+ ops overhead: × 1.25 = $0.45 per 1M tokens
That is the headline number. Realistic, not idealized.
#How this compares
Public May-2026 list prices for Llama-3.3-70B output tokens:
- Together AI: $0.88 / M
- Bedrock (AWS): $0.72 / M
- Replicate: $1.18 / M
- Fireworks: $0.90 / M
- Self-hosted (modeled above): $0.45 / M
So self-hosted is roughly half of Bedrock, a third of Replicate, half of Fireworks, half of Together. That is consistent with our actual customer outcomes, typical migration delivers 50–75% savings depending on workload.
#What pushes the number down further
The $0.45/M number is conservative in several ways:
- Higher utilization. Mature multi-tenant platforms hit 75%+ utilization, dropping per-token cost ~20%.
- Higher cache hit rates. RAG-heavy workloads with consistent system prompts can sustain 60%+ hit rates.
- Cheaper GPUs. RunPod community tier is $1.99/GPU-hr; bare metal amortizes to ~$1.20/GPU-hr over 3 years. Each step down on the GPU price is a proportional drop in the per-token cost.
- Better quantization. FP8 with QAT or speculative decoding on top of FP8 can push sustained throughput meaningfully higher than the 4,000 tok/s baseline.
The aggressive end of this is around $0.18/M tokens. We have customers operating in that range. Getting there takes engineering work, not luck.
#What pushes the number up
- Lower utilization. Spiky traffic (5× peak-to-trough) drops utilization to 30%, which doubles per-token cost.
- No cacheable prefixes. Pure single-shot completion workloads with no shared system prompt do not benefit from prefix caching at all. Per-token cost lands around $0.69 for those.
- Compliance overhead. Air-gapped or HIPAA deployments add operational cost (audit logging, separate key management, restricted networking) that pushes ops overhead above 25%.
- Multi-region. Operating in 3 geographies costs ~3× the single-region setup; per-token cost stays ~constant if utilization holds, but the setup is harder.
#The breakeven
Modeling against Bedrock at $0.72/M tokens, the breakeven is around 50M output tokens/month, at which point the savings from the per-token differential start to exceed the fixed monthly GPU cost.
- 50M tokens/month: ~$0 net (breakeven)
- 200M tokens/month: ~$54/month savings
- 500M tokens/month: ~$135K/year savings
- 2B tokens/month: ~$648K/year savings
The migration cost is roughly fixed regardless of volume. So the payback period is volume-dependent, quick for high-volume workloads, never for low-volume ones.
#How to verify your own number
The calculator at /calculator models this with your inputs. Or run the math yourself:
- Pick your GPU tier and per-hour cost.
- Pick your throughput baseline (use 4,000 tok/s for Llama-70B FP8 on 2× H100; halve it for FP16; double it for 4× H100 with tensor parallel).
- Estimate utilization and cache hit rate from your actual traffic.
- Add 25–35% ops overhead.
- Compare against your current hosted bill.
If the answer says "you save 40%+" and your monthly volume is consistent, the migration is probably worth doing. If it says less than that, the math is closer and the answer depends on factors specific to your team.
The summary: $0.45/M is the realistic number for a typical 70B workload. Your number will land in a band around it. The cost audit produces the band.