What 14 migrations taught us about eval parity
Eval parity is the part everyone underestimates. Here's how we structure it now.
Eval parity, the proof that the new stack is as good as the old one, is the part of every migration that engineers underestimate at proposal time and over-estimate at delivery time. The reason is the same in both directions: the work is unglamorous, the failure modes are subtle, and the only way to do it well is to spend more time on the eval suite than on the optimization.
Across the migrations we have run, eval parity work has averaged 30–40% of the total engagement budget. Migrations where it was less than 20% later produced surprises. Migrations where it was over 50% were proposals where we underbid the eval phase and ate the cost.
#The categories of eval that matter
For a serious migration, you want at least four kinds of eval running before, during, and after the cutover:
- Standard benchmarks. MMLU-Pro, IFEval, BBH, GPQA. These catch obvious capability drops but rarely catch product-specific regressions.
- A workload-specific golden set. 200–2,000 hand-curated prompt/response pairs from real production traffic, scored against a rubric the customer's team agrees on. This is where 80% of the regressions actually surface.
- Production replay. A captured slice of recent production traffic, replayed offline against both the old and new stack. Compare side-by-side using whatever production-quality signals exist (LLM-as-judge, click-through, downstream outcome metrics).
- Live shadow traffic during the migration. The new stack runs in production receiving real requests, but its outputs are dropped and only the metrics are captured. Catches the things that only show up under real load and real distribution shift.
Most teams have at most one or two of these. The cheapest engagement-side mistake is not building the missing two before the migration starts.
#What "parity" should mean
Parity does not mean "exactly the same numbers." A migration always introduces some variance, different floating-point ops, different sampling implementations, different KV cache behavior. Bit-exact parity is impossible and chasing it is a waste of engineering effort.
A useful parity definition has three parts:
- Per-task delta within tolerance. Pick a tolerance per task. 1.5% is our default; tighter for safety-critical tasks, looser for tasks that are already noisy.
- No new failure modes. Categorize the failures of both systems and confirm the new system does not introduce a category that did not exist before.
- Direction-of-error preserved. If the old system errs on the side of caution, the new system should too. A net-flat eval that swaps cautious-errors for confident-errors is a regression in the user experience even if the score is unchanged.
The third one is the part most automated eval pipelines miss. We catch it by sampling 50–100 disagreements between old and new systems and human-reading them side by side.
#The recurring failure modes
Some patterns we see often enough to call out:
Tone drift. Quantization pushes the model toward shorter, more direct responses. On many products this is a feature; on customer support and consumer-facing products it is a regression. Your standard benchmarks will not catch this; you need a tone rubric or a calibrated LLM-as-judge.
Calibration drift. Quantization or distillation changes how confident the model is about uncertain answers. If your product downstream uses model confidence (for retrieval gating, for routing to a human, for "I don't know" responses), a calibration shift is a regression even if accuracy is unchanged.
Long-context regression. Most evals are short-prompt. If your product uses long contexts, the standard suite will not flag a 32K-context regression. We always include a long-context sub-test for products that depend on it.
Tool-use regression. Models often degrade more on tool-use chains than on single-turn quality. Multi-step task evals catch this; standard multiple-choice benchmarks do not.
#How we structure parity for an engagement
The eval phase happens in three parts:
- Pre-engagement (audit). We inventory the customer's existing evals, identify gaps, and write a parity-tolerance contract that goes into the engagement scope.
- Pre-migration (week 1–2). Build or extend the workload-specific golden set. Set up production replay infrastructure. Establish the baseline numbers for the existing system.
- During migration (week 3+). Every quantization step, every kernel change, every runtime swap is gated on rerunning the eval suite. We do not ship a model that has not passed the gate.
- Post-migration (week N+1 onward). Drift detection in production via shadow traffic against a live judge model, alerting on per-task regressions that exceed half the agreed tolerance.
Step 4 is the part most engagements skip and most regressions surface in. A quantization that passed offline can drift a month later because of distribution shift in the input traffic. The customer thinks "the old version was fine" and they are right, what changed is the inputs, but the new model handles the change less gracefully than the old one would have.
#What we open-source
The eval-parity tooling we use is in the fp-eval repo (referenced on /resources). It is a thin layer over EleutherAI's lm-evaluation-harness with adapters for OpenAI / Anthropic / vLLM / TensorRT-LLM / SGLang, a paired-bootstrap significance test, and an HTML parity report that produces the artifact we send the customer.
It is not magic. It is a workflow. The hard part, building the workload-specific golden set and the rubric, is engineering work no tool can substitute for.
#The summary
Eval parity is the part of a migration where the failure mode is "ship a regression silently." That failure mode is much worse for the customer than a delayed migration or a noisy benchmark. Most of our engagement risk-reduction is concentrated here. Most of the customer's objections during the cutover are answered here. If you are choosing a consultancy for a migration, the question to ask is not how fast they ship, it is how they will prove parity, in writing, before you flip traffic.