Speculative decoding doesn't help if your batch size is wrong
Three case studies where the same trick gave a 4× and a 0.94×. The interaction between speculative decoding and continuous batching is the part nobody explains.
Speculative decoding is the optimization that everybody asks for and most engagements do not end up using. We have shipped it in production three times. In one of those it gave a 4.1× throughput uplift on a 70B model. In the other two it was a wash, and in one of those two it actually slowed things down to 0.94× the baseline.
The variable that explains the difference is not the draft model. It is the batch size.
#Why the speedup is conditional
Speculative decoding works by having a small fast "draft" model propose multiple tokens, then a big "verifier" model check them in parallel. When acceptance is high, you generate N tokens for the cost of one big-model forward pass. When acceptance is low, you pay for the draft work and most of the verifier work, and you lose.
The catch: the verifier is doing N forward passes, but on the GPU those N passes are batched together, and the cost of batching N passes is sub-linear. Meaning: if your batch is already large (you are serving many concurrent requests), the verifier's work was already cheap per token, and the speculative speedup gets compressed against the existing batching speedup.
Concretely: if you are running batch size 1 (single user, low concurrency), speculative decoding gives you most of its theoretical 3–5× speedup. If you are running batch size 64+ at saturation, speculative decoding gives you maybe 1.0–1.2× because continuous batching is already doing most of the work it would have done.
#Case 1, voice agents, 4.1× win
Workload: real-time voice agent, sub-200ms p95 latency requirement, batch size effectively 1 because each request is one user's turn and they cannot wait. Adding EAGLE-3 speculative decoding (Llama-3.3-70B verifier, Llama-3.2-1B draft) gave a 4.1× throughput uplift and dropped median TTFT from 180ms to 55ms.
This is the textbook case. Latency-sensitive single-stream workload, low natural batching, lots of headroom for speculation to fill.
#Case 2, code assistant, 1.05× (wash)
Workload: bulk code completion, 60+ concurrent requests at peak, batch sizes of 32–48 typical. Adding the same EAGLE-3 setup produced a 5% throughput improvement, within noise. The continuous batching was already extracting most of the available parallelism, and speculation just added overhead the verifier was not benefiting from.
We left it off. The complexity was not worth the rounding-error gain.
#Case 3, RAG platform, 0.94× (regression)
Workload: RAG over a long-context document store, prompts averaging 8K tokens, generations averaging 200 tokens. Speculative decoding actively slowed things down by 6%, which surprised us until we measured the prefill-vs-decode split.
This workload was prefill-dominated. The big-model forward passes were already cheap relative to the prefill cost, so the marginal saving from speculation was small. Meanwhile, the draft model added wall-clock latency to every step, and the additional GPU memory the draft model occupied reduced the batch capacity of the verifier. Net loss.
The fix was not to tune speculative decoding harder. It was to drop it entirely and instead invest the engineering time in a better prefix cache, which gave a real 2.4× win on this workload because of the high prefix overlap across queries.
#How we decide now
Before recommending speculative decoding, we look at three numbers from the production traffic:
- Median batch size at the point of decode. If it is < 4, speculation is likely to win. If it is > 16, the gain is probably small.
- Prefill / decode ratio. If prompts are long and generations are short (RAG, summarization), prefill dominates and speculation does not help much.
- Acceptance rate of the draft model. Below ~60% acceptance, the math stops working regardless of batch size.
If all three look favorable, we run a 2-week pilot. If they do not, we redirect that engineering investment somewhere with a better return.
#What this means for runtime selection
vLLM, TensorRT-LLM, and SGLang all support speculative decoding now. The runtime is not the bottleneck. The workload is. Picking a runtime because it has "the latest spec decode" is exactly the kind of mistake that produces case 2, you implemented the feature, you are paying for the complexity, you are not measurably faster.
The lesson, again: pick the workload, measure it, then pick the optimization. Doing it the other way around is how teams end up with elaborate infrastructure that does not move the metrics that matter.