FastPriors
Talk to an engineer
← all resourcesbenchmark · 9 min

FP8 vs INT8 quantization across 14 production workloads

Throughput, latency, and eval-parity outcomes for FP8 and INT8 across customer-support, code-completion, RAG, and agent workloads on H100. Where each one wins.

Headline

FP8 wins on H100 for almost every modern open-weights model. INT8 still wins on A100, on customer-controlled inference where FP8 hardware is unavailable, and for some reasoning-heavy workloads where INT8 calibration produces tighter quality than FP8's symmetric scaling.

The data

Aggregated across 14 production migrations we have run since 2024:

  • FP8 on H100, throughput uplift over BF16: 1.55× median, 1.41–1.78× range
  • INT8 on H100, throughput uplift over BF16: 1.32× median, 1.18–1.46× range
  • FP8 eval parity within 1.5% tolerance: 11 of 14 workloads passed first try; 3 needed mixed-precision
  • INT8 eval parity within 1.5% tolerance: 9 of 14 workloads passed first try; 5 needed per-channel scaling or SmoothQuant
  • Memory savings: FP8 = 2× over BF16, INT8 = 2× over BF16 (effectively the same)

Why FP8 generally wins

H100's tensor cores have native FP8 paths at 2× the throughput of BF16. INT8 requires more dequantization on the activation side, which eats some of the theoretical 2×. The net effect: FP8 lands about 15% above INT8 on H100 in our measurements.

FP8's exponent representation also tolerates outlier activations better than INT8's clipping behavior. This is why fewer workloads need mixed-precision rescue with FP8 than with INT8.

When INT8 is the right answer

  • Older hardware: A100, V100, L4, no FP8 tensor cores. INT8 is the only meaningful quantization win.
  • AMD ROCm: FP8 support is improving but not as mature. INT8 path is more battle-tested.
  • Workloads where you have lots of calibration data: per-channel INT8 with extensive calibration can outperform default FP8 on some reasoning tasks.
  • Mid-tier chip targets: embedded or edge inference where FP8 hardware doesn't exist yet.

FP4 and the road ahead

NVIDIA Blackwell (B100/B200) ships FP4 tensor cores. Our preliminary tests show another 1.4–1.6× over FP8 on supported workloads. As of mid-2026 the FP4 calibration tooling is less mature; we recommend running FP8 first and migrating to FP4 once the runtime support stabilizes (likely vLLM 0.7.x, TensorRT-LLM 0.16+).

Recommendation tree

  1. Are you on H100 / H200 / B200? → FP8 first.
  2. Are you on A100 / older NVIDIA? → INT8.
  3. Are you on AMD MI300X? → INT8 today, FP8 in 2027.
  4. Did your FP8 attempt fail eval parity? → Try mixed-precision (most-sensitive layers in BF16) before retreating to INT8.

Before any quantization, build the eval parity harness. Without it, you have no way to know whether your savings came at a quality cost.

Want this run on your workload?

The cost audit ends with a custom version of this for your traffic. Refundable.

Talk to an engineer →