Llama-3.3-70B inference: hosted vs self-hosted on 8× H100, FastPriors

The numbers, headline-first

For a Llama-3.3-70B output workload at 500M tokens/month with mid-saturation traffic and 40% cacheable prefixes:

Together AI: $440 / mo (modeled at $0.88/M)
Bedrock: $360 / mo
Fireworks: $450 / mo
Replicate: $590 / mo
Self-hosted, 2× H100 RunPod secure: $185 / mo amortized
Self-hosted, 2× H100 bare metal (3-yr): $115 / mo amortized

The hardware setup we're modeling

One stack of 2× H100 SXM, FP8 weights via vLLM 0.6.x, paged attention enabled with prefix caching. Sustained 4,000 output tok/s at mid concurrency, conservative midpoint of public vLLM benchmarks. With 60% utilization and 40% cache hits, effective monthly capacity is ~9.6B output tokens per stack.

Per-vendor analysis

Together AI ($0.88/M): the most popular "managed open-weights" option. Solid quality, transparent pricing, no quota games. The premium over self-hosted reflects their ops cost and margin.

AWS Bedrock ($0.72/M): the cheapest managed option for compliance-driven customers. AWS's scale gives them a per-token cost advantage. Caveat: regional availability varies and provisioned-throughput pricing is significantly higher.

Replicate ($1.18/M): the highest of the four. Pricing reflects their developer-friendly API, async-friendly deployment model, and a smaller-scale serving footprint per customer.

Fireworks ($0.90/M): in the middle. Quality is strong; their pitch is on speed and reliability rather than price.

Self-hosted on RunPod (~$0.40/M effective): the most popular GPU-specialty cloud. Secure tier at $2.39/GPU-hr is the realistic planning number; community tier ($1.99) works for non-prod.

Bare metal (~$0.25/M effective): only economical if you sustain high utilization across years. Operational overhead is real but quantifiable. See our defense of bare metal.

Where the savings come from

The gap between $0.88/M (Together) and $0.40/M (self-hosted) is roughly: 35% vendor margin, 25% vendor ops/SRE cost amortization, 15% vendor R&D allocation, 25% real infrastructure delta. The first three are what you opt out of by self-hosting; the last is the real cost difference.

Where the calculator lands

The interactive version of this analysis is at /calculator, plug in your monthly volume, current provider, and assumptions to get the number for your workload.

Caveats

Throughput numbers are conservative midpoints. Real workloads land in a band of ±25%. The cost audit produces the workload-specific number.

Llama-3.3-70B inference: hosted vs self-hosted on 8× H100