FP8 inference on H100: a worked migration
Hands-on log of moving a Llama 3.1 70B workload from BF16 to FP8 on H100. What broke, what we measured, and the eval parity report.
This is a worked log from one of our recent migrations: moving a 70B-class production workload from BF16 to FP8 on 8× H100 SXM. The headline number was 1.6× throughput at flat eval scores. The interesting part is what happened in the middle.
#The setup
Starting state: Llama-3.3-70B in BF16, served via vLLM 0.6.x on 8× H100 SXM, ~$32K/month at the GPU layer, p95 latency 380ms, throughput ~1,200 tok/s aggregate. The product is a customer support copilot at a Series B SaaS. Real production traffic, mostly mid-length prompts (1–4K tokens) with mid-length generations (100–400 tokens).
Goal: reduce per-token cost without measurable quality regression on the customer's eval suite (a mix of internal golden-set tests, IFEval, and a custom rubric for tone matching).
#Why FP8 was the candidate
H100 has dedicated FP8 tensor cores at twice the throughput of BF16. NVIDIA publishes ~1.6–1.8× theoretical speedup. In practice you rarely see the full theoretical number, kernel launch overhead, memory bandwidth, and the conversion cost on either side of the fast path eat into it. But the lower bound on a well-tuned migration is around 1.4×, which is enough to justify the eng work for a workload at this scale.
The other reason: FP8 weights at half the memory of BF16 means you can fit a larger KV cache in the same GPU, which separately helps throughput. The cumulative effect is what makes FP8 the highest-leverage quantization step on H100 right now.
#Step 1, eval parity baseline
Before changing the runtime, we ran the customer's full eval suite against the BF16 baseline to establish the reference numbers. This took half a day. The results became the bar that the FP8 version had to clear.
We use a paired-bootstrap test for significance. Per-task delta < 1.5% with 95% CI is our default tolerance for a quantization migration; the customer signed off on that before we touched the model.
#Step 2, pick a quantization path
Three paths on H100 FP8:
- Per-tensor scaling, calibrated. Easiest. Run a calibration set, get scale factors, ship. Quality usually within tolerance for 70B-class models.
- Per-channel scaling, calibrated. Slightly more complex. Better quality on small/mid models. Marginal on 70B.
- FP8 with QAT (quantization-aware training). Best quality, but requires training compute the customer did not have. We did not consider it for this engagement.
We started with per-tensor calibrated, ~10K calibration samples drawn from production traffic.
#Step 3, what broke
The first FP8 build passed throughput targets and failed eval parity. Specifically:
- IFEval: -0.4% (within tolerance)
- Internal golden set: +0.1% (within tolerance, slightly better, noise)
- Tone-matching rubric: -3.7% (over tolerance)
The tone-matching regression was concentrated in one specific use case: generating empathetic responses to angry customer messages. The FP8 version was producing measurably colder responses. The model had not lost capability, it had just lost the steering signal that made tone alignment work.
Investigation showed the regression was driven by activation outliers in the early decoder layers. The per-tensor scaling was clipping these, which manifested as the tone signal getting attenuated.
#Step 4, fix
Three options to recover the tone signal:
- Switch to per-channel scaling. Higher implementation cost, marginal expected gain on 70B.
- Identify the outlier-prone layers and keep them in BF16. Mixed-precision approach. Costs ~5% of the throughput gain, recovers most of the quality.
- SmoothQuant-style activation smoothing before quantization. Free at inference time but requires a calibration-time transform.
We went with option 3, SmoothQuant on the early decoder layers, per-tensor FP8 elsewhere. Final eval parity numbers:
- IFEval: -0.2% (within tolerance)
- Internal golden set: +0.1% (within tolerance)
- Tone-matching rubric: -1.1% (within tolerance)
#Step 5, production rollout
We staged the rollout behind a feature flag. 5% of traffic for 48 hours, watching latency, error rates, and a live drift dashboard for the eval rubric. No regressions surfaced in production that the offline evals had not predicted. Ramped to 25%, then 50%, then 100% over a week.
BF16 was kept warm in the same cluster for 30 days as a fallback. We never had to flip back. After 30 days we tore down the BF16 deployment and reclaimed the GPUs.
#The numbers
- Throughput: 1.62× (1,200 → 1,944 tok/s aggregate)
- p95 latency: 380ms → 218ms (-43%)
- Per-token cost: -38% (with the throughput gain reflected as fewer GPU-hours per million tokens)
- Eval parity: all metrics within signed-off tolerance
- Engineering time: 4 weeks end-to-end
#What we'd do differently
The right call would have been to start with mixed-precision rather than going pure FP8 first. We spent a week chasing a regression that mixed-precision would have avoided. The total engagement was still on schedule, but the customer's engineering team did not love the "we hit a wall, here's the fix" mid-engagement update.
For 70B-class models with tone-sensitive use cases, mixed-precision is now our default first attempt. For 8B-class or for use cases where tone is not a feature, pure FP8 is fine.
#What this generalizes to
FP8 on H100 is real performance for real money. The 1.6× number is repeatable. The catch is that quantization is never lossless, and the loss usually shows up somewhere your standard benchmarks do not look. Building a workload-specific eval suite before you quantize, and treating eval parity as a hard gate, is the difference between a clean migration and a quiet quality regression that erodes user trust over weeks.