FastPriors
Talk to an engineer
comparison · serving

vLLM vs TensorRT-LLM
(vs SGLang).

Three open-source inference engines, three different bets. We have shipped all of them in production. Here is the decision tree we actually use.

The 30-second answer

Default to vLLM. Move to TensorRT-LLM when you have one model on one hardware target with enough scale to amortize the build complexity. Reach for SGLang when your workload is structured generation (JSON, agent loops, constrained decoding, long shared prefixes).

Pick by traffic shape, not Twitter consensus. Below is the actual reasoning.

What each one is good at

DimensionvLLMTensorRT-LLMSGLang
HardwareNVIDIA, AMD, TPU, IntelNVIDIA onlyNVIDIA, AMD
Buildpip installengine compile per model+configpip install + JIT
Time to first tokenminuteshours (the compile step is the cost)minutes
Peak throughputbaseline+10-30% on tuned configs~vLLM + faster on structured gen
Continuous batchingfirst-classin-flight batching, maturefirst-class with RadixAttention
Prefix cachingpaged-attention basedin 2024.x+, less matureRadixAttention, best-in-class
QuantizationAWQ, GPTQ, FP8, INT8FP8 native, AWQ via pluginsAWQ, GPTQ, FP8
Speculative decodingMedusa, EAGLE, draftMedusa, ReDrafterEAGLE-3, lookahead
Structured outputguided JSON via outlineslimitedfirst-class, its origin story
Disaggregated prefill/decodeexperimental (P/D)production via Dynamoproduction via OpenRLHF integrations
Communitybiggest, fastest movingNVIDIA-led, slower cyclesmall but technically dense
Best formost workloads, most teamsone model, one cluster, at scaleagents, JSON, RAG with prefixes

When to pick vLLM

You are starting a new self-hosted inference deployment. You don't yet know your traffic shape with certainty. You want to be able to swap models, change hardware, and ship in weeks not quarters.

vLLM is the right default. Continuous batching is mature, paged attention works, prefix caching is decent, the community ships fast (a major release every ~6 weeks), and the OpenAI-compatible API means your client code does not need to change. If you migrate off OpenAI tomorrow, vLLM is the runtime that lets you do it without committing to anything.

The honest catch: vLLM's peak throughput on a tuned NVIDIA cluster is ~10–30% behind TensorRT-LLM on the exact same hardware. For most workloads, that gap is dominated by other inefficiencies (your batching policy, your prompt distribution, your network) and is not worth the engineering cost to close.

When to pick TensorRT-LLM

You have one model running at significant scale on NVIDIA hardware, you have already tuned vLLM, and you have a number of GPUs where the absolute throughput delta matters. Concretely: the threshold is around 16+ GPUs serving a single model, where a 20% throughput uplift saves enough GPU-hours to pay for the engineering effort within a quarter.

The build cost is real. TensorRT-LLM compiles a model into a per-config engine, pin model, dtype, max batch size, max sequence length, tensor-parallel degree, and you get an optimized binary. Change any of those and the engine has to be rebuilt. Compile times run from minutes to hours depending on model size. For a multi-tenant platform serving many models or rapidly iterating on configurations, the rebuild loop is operationally painful.

Pair it with NVIDIA's Dynamo for disaggregated prefill/decode if your workload is heavily prefill-dominated (long prompts, short generations), the gain compounds.

When to pick SGLang

Your workload is structured generation: agent loops, JSON-constrained outputs, RAG with shared system prompts, multi-turn chat with growing context. SGLang's RadixAttention is the best prefix cache implementation in the open-source ecosystem, published cache-hit rates of 80–90% on workloads with shared prefixes, with corresponding 2–5× wall-clock reductions on prefill.

SGLang also has the cleanest constrained-decoding path. If you are building anything that pushes structured outputs through a JSON schema or a regex grammar, SGLang's native support is significantly faster than bolt-on libraries layered over vLLM.

Trade-off: SGLang has a smaller community and slower OSS cycle than vLLM. Production gaps occasionally show up (autoscaling integrations, observability, model coverage on edge cases). We use it where the workload demands it; not where vLLM would do.

The decision tree we actually use

  1. Are you serving one model at < 16 GPU scale and unsure of long-term traffic? → vLLM. Stop here.
  2. Is your workload structured-output heavy (JSON, agent loops, > 40% prefix-shareable)? → SGLang.
  3. Do you have one model at 16+ GPUs, NVIDIA-only, with stable config? → benchmark TensorRT-LLM against your vLLM baseline. Only switch if the throughput delta > 15% AND the rebuild cost is amortizable in < 90 days.
  4. Are you mixing prefill-heavy and decode-heavy traffic on the same GPUs? → consider disaggregated prefill/decode (Dynamo + TensorRT-LLM, or llm-d). The gain only justifies the complexity at scale.
  5. Multi-tenant SaaS serving many models with rapid iteration? → vLLM. The TRT compile loop will eat your week.

What the benchmarks actually say

Public numbers are noisy because everyone benchmarks different model sizes, batch shapes, and prompt distributions. The honest summary as of mid-2026:

  • Llama-3.3-70B FP8 on 8× H100 SXM, server scenario: TensorRT-LLM ~12,000 tok/s peak; vLLM ~10,200 tok/s peak; SGLang ~10,500 tok/s peak.
  • Same hardware, mixed-prefix RAG workload: SGLang wins by ~2× on prefill due to RadixAttention; the others catch up only with hand-tuned cache configs.
  • Mistral-Small-24B on 1× H200, latency-sensitive single-stream: all three within 5%; vLLM's shorter cold start usually wins overall.
  • Build/iterate cycle on a developer laptop config change: vLLM ~2 minutes, SGLang ~3 minutes, TensorRT-LLM 20 min – 2 hours.

Numbers are conservative midpoints from public vLLM, NVIDIA, and SGLang benchmark blogs; your hardware and workload will land somewhere in a band. We rerun on customer traffic during the optimization engagement, that is what produces the number you actually care about.

What we recommend most often

For 80% of the migrations we run, the answer is: start on vLLM, run it for 8 weeks, then revisit. Once you know your real prompt distribution, real cache hit rate, real GPU saturation, and real latency budget, the "should we move to TRT-LLM" question answers itself with concrete numbers instead of vendor copy.

Do not pick a runtime first and then build a workload around it. Pick the workload, measure it, then pick the runtime.

Stuck on the runtime decision?

We have benchmarked all three on real production traffic. The cost audit ends with a written runtime recommendation grounded in your actual workload.

Talk to an engineer →