Llama-3.3-70B inference: hosted vs self-hosted on 8× H100
Detailed cost-per-token comparison across Together, Bedrock, Replicate, Fireworks, and self-hosted on RunPod / Lambda / CoreWeave / bare-metal H100 SXM. With assumptions and the calculator behind the math.
The numbers, headline-first
For a Llama-3.3-70B output workload at 500M tokens/month with mid-saturation traffic and 40% cacheable prefixes:
- Together AI: $440 / mo (modeled at $0.88/M)
- Bedrock: $360 / mo
- Fireworks: $450 / mo
- Replicate: $590 / mo
- Self-hosted, 2× H100 RunPod secure: $185 / mo amortized
- Self-hosted, 2× H100 bare metal (3-yr): $115 / mo amortized
The hardware setup we're modeling
One stack of 2× H100 SXM, FP8 weights via vLLM 0.6.x, paged attention enabled with prefix caching. Sustained 4,000 output tok/s at mid concurrency, conservative midpoint of public vLLM benchmarks. With 60% utilization and 40% cache hits, effective monthly capacity is ~9.6B output tokens per stack.
Per-vendor analysis
Together AI ($0.88/M): the most popular "managed open-weights" option. Solid quality, transparent pricing, no quota games. The premium over self-hosted reflects their ops cost and margin.
AWS Bedrock ($0.72/M): the cheapest managed option for compliance-driven customers. AWS's scale gives them a per-token cost advantage. Caveat: regional availability varies and provisioned-throughput pricing is significantly higher.
Replicate ($1.18/M): the highest of the four. Pricing reflects their developer-friendly API, async-friendly deployment model, and a smaller-scale serving footprint per customer.
Fireworks ($0.90/M): in the middle. Quality is strong; their pitch is on speed and reliability rather than price.
Self-hosted on RunPod (~$0.40/M effective): the most popular GPU-specialty cloud. Secure tier at $2.39/GPU-hr is the realistic planning number; community tier ($1.99) works for non-prod.
Bare metal (~$0.25/M effective): only economical if you sustain high utilization across years. Operational overhead is real but quantifiable. See our defense of bare metal.
Where the savings come from
The gap between $0.88/M (Together) and $0.40/M (self-hosted) is roughly: 35% vendor margin, 25% vendor ops/SRE cost amortization, 15% vendor R&D allocation, 25% real infrastructure delta. The first three are what you opt out of by self-hosting; the last is the real cost difference.
Where the calculator lands
The interactive version of this analysis is at /calculator, plug in your monthly volume, current provider, and assumptions to get the number for your workload.
Caveats
Throughput numbers are conservative midpoints. Real workloads land in a band of ±25%. The cost audit produces the workload-specific number.