FastPriors
Talk to an engineer
← all postswriteup · 12 min

What it actually costs to serve Llama-3.3-70B yourself in 2026

Public GPU rental rates × public benchmarks × honest utilization assumptions. The number lands at around $0.45/M output tokens, half of Bedrock, a third of Replicate.

April 19, 2026by Abhimanyu Singh

The single most asked question we get from prospective customers: what does it actually cost to serve a 70B-class open-weights model on owned GPUs? The published vendor numbers are not directly comparable, the public benchmarks use idealized configurations, and the math involves several variables that customers reasonably do not want to estimate on their own.

Here is the answer, with the work shown. The headline: ~$0.45 per 1M output tokens for Llama-3.3-70B FP8 on 2× H100 SXM at realistic utilization, with 25% ops overhead included.

#The inputs

GPU rental cost. H100 SXM secure-tier from RunPod is $2.39/GPU-hour as of May 2026. Two GPUs = $4.78/hour for the inference stack.

Sustainable throughput. Public vLLM 0.6.x benchmarks for Llama-3.3-70B FP8 on 2× H100 SXM at mid concurrency: ~6,000 output tok/s peak. We use 4,000 as the planning number, it derates for realistic prompt distributions, partial saturation, and the difference between published "peak" and sustained-through-day numbers.

Utilization. Single-product workloads typically achieve 50–70% utilization without aggressive multi-tenant batching. We model 60%.

Ops overhead. Monitoring, alerting, eval drift detection, on-call rotation, hardware failure response. We add 25% to the GPU bill to cover this. Customers running a mature inference platform on top of other GPUs amortize closer to 15%; teams starting from scratch land closer to 35%.

#The math

monthly hours:        720
gpu cost per stack:   $4.78/hr × 720       = $3,442
sustained output tps: 4,000 × 0.60         = 2,400 tok/s
monthly tokens:       2,400 × 3,600 × 720  = 6.22B output tokens
$ per 1M tokens:      $3,442 / 6,220       = $0.55
+ ops overhead:       × 1.25                = $0.69 per 1M tokens

Wait, that is $0.69, not $0.45. The difference is prefix caching.

#Prefix caching is half the answer

Most production inference workloads have shareable prefixes, system prompts, retrieved context, multi-turn chat history. Prefix caching skips the prefill work for the matching portion of the input. Public benchmarks of vLLM's paged-attention prefix cache show 5–12× speedup on the cached portion, varying with cache hit rate.

For a typical RAG or chatbot workload, cache hit rates run 30–60%. Modeling 40% with an effective 8× speedup on hits:

effective capacity:   2,400 × (1 / (1 - 0.40 × 0.875)) = 3,712 tok/s
monthly tokens:       3,712 × 3,600 × 720  = 9.62B output tokens
$ per 1M tokens:      $3,442 / 9,620       = $0.36
+ ops overhead:       × 1.25                = $0.45 per 1M tokens

That is the headline number. Realistic, not idealized.

#How this compares

Public May-2026 list prices for Llama-3.3-70B output tokens:

  • Together AI: $0.88 / M
  • Bedrock (AWS): $0.72 / M
  • Replicate: $1.18 / M
  • Fireworks: $0.90 / M
  • Self-hosted (modeled above): $0.45 / M

So self-hosted is roughly half of Bedrock, a third of Replicate, half of Fireworks, half of Together. That is consistent with our actual customer outcomes, typical migration delivers 50–75% savings depending on workload.

#What pushes the number down further

The $0.45/M number is conservative in several ways:

  • Higher utilization. Mature multi-tenant platforms hit 75%+ utilization, dropping per-token cost ~20%.
  • Higher cache hit rates. RAG-heavy workloads with consistent system prompts can sustain 60%+ hit rates.
  • Cheaper GPUs. RunPod community tier is $1.99/GPU-hr; bare metal amortizes to ~$1.20/GPU-hr over 3 years. Each step down on the GPU price is a proportional drop in the per-token cost.
  • Better quantization. FP8 with QAT or speculative decoding on top of FP8 can push sustained throughput meaningfully higher than the 4,000 tok/s baseline.

The aggressive end of this is around $0.18/M tokens. We have customers operating in that range. Getting there takes engineering work, not luck.

#What pushes the number up

  • Lower utilization. Spiky traffic (5× peak-to-trough) drops utilization to 30%, which doubles per-token cost.
  • No cacheable prefixes. Pure single-shot completion workloads with no shared system prompt do not benefit from prefix caching at all. Per-token cost lands around $0.69 for those.
  • Compliance overhead. Air-gapped or HIPAA deployments add operational cost (audit logging, separate key management, restricted networking) that pushes ops overhead above 25%.
  • Multi-region. Operating in 3 geographies costs ~3× the single-region setup; per-token cost stays ~constant if utilization holds, but the setup is harder.

#The breakeven

Modeling against Bedrock at $0.72/M tokens, the breakeven is around 50M output tokens/month, at which point the savings from the per-token differential start to exceed the fixed monthly GPU cost.

  • 50M tokens/month: ~$0 net (breakeven)
  • 200M tokens/month: ~$54/month savings
  • 500M tokens/month: ~$135K/year savings
  • 2B tokens/month: ~$648K/year savings

The migration cost is roughly fixed regardless of volume. So the payback period is volume-dependent, quick for high-volume workloads, never for low-volume ones.

#How to verify your own number

The calculator at /calculator models this with your inputs. Or run the math yourself:

  1. Pick your GPU tier and per-hour cost.
  2. Pick your throughput baseline (use 4,000 tok/s for Llama-70B FP8 on 2× H100; halve it for FP16; double it for 4× H100 with tensor parallel).
  3. Estimate utilization and cache hit rate from your actual traffic.
  4. Add 25–35% ops overhead.
  5. Compare against your current hosted bill.

If the answer says "you save 40%+" and your monthly volume is consistent, the migration is probably worth doing. If it says less than that, the math is closer and the answer depends on factors specific to your team.

The summary: $0.45/M is the realistic number for a typical 70B workload. Your number will land in a band around it. The cost audit produces the band.

Want this on your stack?

The cost audit lands at a number, not a recommendation. Refundable.

Talk to an engineer →Try the calculator