FastPriors
Talk to an engineer
← all postsengineering · 10 min

Prefix caching is the highest-leverage optimization you're not doing

A 3× cost-per-token reduction is sitting in your existing system prompts and you can ship it in a day. Here is how it works and where the gotchas are.

April 8, 2026by Kaushlendra Kumar Giri

Of all the inference optimizations available in 2026, prefix caching has the largest gap between "available out of the box in your runtime" and "actually configured well in production." vLLM, TensorRT-LLM, and SGLang all support it. Most teams either do not have it on, or have it on with default settings that capture half the available benefit.

This post is what prefix caching does, why it works, and how to configure it for the workloads where it is genuinely transformative.

#What it does

When two requests share a prefix, say, a 2,000-token system prompt that all your users share, the model has to do prefill work on those tokens for each request. Prefix caching skips that work for the second request: the KV cache for the shared prefix is already computed, you reuse it, and you only pay prefill for the divergent tail.

The savings are proportional to the prefix-to-total-input ratio. If your prompts average 2,500 tokens with a 2,000-token shared system prompt and 500-token user message, the cache saves 80% of prefill work on every cache hit.

#Where it lives in your runtime

vLLM has paged-attention-based prefix caching since 0.5.x. SGLang has RadixAttention since launch (it is the most sophisticated implementation in the open ecosystem). TensorRT-LLM 0.14+ has KV cache reuse, less mature than the others but improving.

All three need to be turned on explicitly. The default in most distributions is "disabled" or "enabled with conservative cache size."

#The hit rate is everything

Without measurement, configurations make no sense. The first thing to measure on any production deployment is the prefix cache hit rate. The runtime exposes this; you have to expose it to your monitoring stack.

Approximate hit rates by workload type:

  • Single-product chatbot with consistent system prompt: 60–90% hit rate
  • Multi-tenant SaaS with per-customer prompts: 30–60% hit rate
  • RAG with top-K retrieval and no consistent prefix: 5–25% hit rate
  • Agent loop with growing context: 70–95% hit rate (each step shares the prior steps)
  • Code completion with file context: 40–70% hit rate

If you are at the bottom of any of these ranges, there is configuration headroom. If you are at the top, the workload is doing what it should.

#The configuration knobs that matter

Cache size. The default is usually too small. KV cache memory comes out of the same GPU memory pool as the model weights and the active batch's KV. For a workload with high cache hit potential, push the cache to consume 30–50% of free GPU memory, not 5–10%.

Eviction policy. LRU is the default. For workloads with predictable shared prefixes (system prompt that never changes), an LRU policy can evict your most-valuable cached entries when temporarily-popular per-user prefixes flood in. Pinned cache entries for known-shared prefixes solve this; vLLM supports it via the prefix_cache config.

Cache granularity. Most runtimes cache at the token-block level (16 or 32 tokens). Larger blocks waste cache for short prefixes; smaller blocks have higher overhead. The default is usually fine; if you tune, do so based on measured hit rate impact.

Multi-GPU shard alignment. Tensor-parallel deployments need their cache sharded the same way as the model. This is automatic in vLLM but worth verifying, a misaligned cache effectively does not work.

#The trick that doubles cache hits

The single highest-leverage configuration change for most teams: structure your prompts so the cacheable part comes first.

// Bad, user message at the start
"User asked: what is the company return policy?
[2KB system prompt with policy details]
Answer:"

// Good, system prompt first, user message at the end
"[2KB system prompt with policy details]
User asked: what is the company return policy?
Answer:"

The cacheable portion has to be a literal prefix of the input. If the divergent user content is at the start, no caching can happen. Most prompt engineering tutorials structure prompts in the "bad" format above, and many teams have not noticed they are leaving an order-of-magnitude optimization on the table.

Reorder the prompts, redeploy, measure cache hit rate. We have seen teams go from 12% to 78% hit rate by this single change. That is 5× more savings on the prefill side.

#Where prefix caching does not help

  • Decode-bound workloads. If most of your wall-clock time is in the generation (long outputs, short prompts), caching prefill saves something close to zero.
  • Workloads with no shared structure. Single-shot text classification, summarization with one-off documents, there is nothing to cache.
  • Random-access patterns at scale. If your hit rate is < 10% and your cache misses are evicting useful entries from other tenants, the cache might be marginally hurting throughput. Disable it for the affected workloads.

#What to ship today

  1. Verify prefix caching is enabled in your runtime config. If not, enable it.
  2. Expose the cache hit rate metric to your monitoring.
  3. Look at the hit rate; compare to the rough ranges above. If your workload should be high-hit and is not, your prompts are probably structured wrong.
  4. Reorder prompts so the cacheable prefix comes first.
  5. Re-measure. Most teams see immediate 1.5–3× cost-per-token improvements without changing anything else.

This is the cheapest production optimization in inference today. Run it first, before reaching for speculative decoding or quantization, because the work is one config change and a prompt-template refactor, not weeks of engineering.

Want this on your stack?

The cost audit lands at a number, not a recommendation. Refundable.

Talk to an engineer →Try the calculator