The 30-second answer
Default to vLLM. Move to TensorRT-LLM when you have one model on one hardware target with enough scale to amortize the build complexity. Reach for SGLang when your workload is structured generation (JSON, agent loops, constrained decoding, long shared prefixes).
Pick by traffic shape, not Twitter consensus. Below is the actual reasoning.
What each one is good at
| Dimension | vLLM | TensorRT-LLM | SGLang |
|---|---|---|---|
| Hardware | NVIDIA, AMD, TPU, Intel | NVIDIA only | NVIDIA, AMD |
| Build | pip install | engine compile per model+config | pip install + JIT |
| Time to first token | minutes | hours (the compile step is the cost) | minutes |
| Peak throughput | baseline | +10-30% on tuned configs | ~vLLM + faster on structured gen |
| Continuous batching | first-class | in-flight batching, mature | first-class with RadixAttention |
| Prefix caching | paged-attention based | in 2024.x+, less mature | RadixAttention, best-in-class |
| Quantization | AWQ, GPTQ, FP8, INT8 | FP8 native, AWQ via plugins | AWQ, GPTQ, FP8 |
| Speculative decoding | Medusa, EAGLE, draft | Medusa, ReDrafter | EAGLE-3, lookahead |
| Structured output | guided JSON via outlines | limited | first-class, its origin story |
| Disaggregated prefill/decode | experimental (P/D) | production via Dynamo | production via OpenRLHF integrations |
| Community | biggest, fastest moving | NVIDIA-led, slower cycle | small but technically dense |
| Best for | most workloads, most teams | one model, one cluster, at scale | agents, JSON, RAG with prefixes |
When to pick vLLM
You are starting a new self-hosted inference deployment. You don't yet know your traffic shape with certainty. You want to be able to swap models, change hardware, and ship in weeks not quarters.
vLLM is the right default. Continuous batching is mature, paged attention works, prefix caching is decent, the community ships fast (a major release every ~6 weeks), and the OpenAI-compatible API means your client code does not need to change. If you migrate off OpenAI tomorrow, vLLM is the runtime that lets you do it without committing to anything.
The honest catch: vLLM's peak throughput on a tuned NVIDIA cluster is ~10–30% behind TensorRT-LLM on the exact same hardware. For most workloads, that gap is dominated by other inefficiencies (your batching policy, your prompt distribution, your network) and is not worth the engineering cost to close.
When to pick TensorRT-LLM
You have one model running at significant scale on NVIDIA hardware, you have already tuned vLLM, and you have a number of GPUs where the absolute throughput delta matters. Concretely: the threshold is around 16+ GPUs serving a single model, where a 20% throughput uplift saves enough GPU-hours to pay for the engineering effort within a quarter.
The build cost is real. TensorRT-LLM compiles a model into a per-config engine, pin model, dtype, max batch size, max sequence length, tensor-parallel degree, and you get an optimized binary. Change any of those and the engine has to be rebuilt. Compile times run from minutes to hours depending on model size. For a multi-tenant platform serving many models or rapidly iterating on configurations, the rebuild loop is operationally painful.
Pair it with NVIDIA's Dynamo for disaggregated prefill/decode if your workload is heavily prefill-dominated (long prompts, short generations), the gain compounds.
When to pick SGLang
Your workload is structured generation: agent loops, JSON-constrained outputs, RAG with shared system prompts, multi-turn chat with growing context. SGLang's RadixAttention is the best prefix cache implementation in the open-source ecosystem, published cache-hit rates of 80–90% on workloads with shared prefixes, with corresponding 2–5× wall-clock reductions on prefill.
SGLang also has the cleanest constrained-decoding path. If you are building anything that pushes structured outputs through a JSON schema or a regex grammar, SGLang's native support is significantly faster than bolt-on libraries layered over vLLM.
Trade-off: SGLang has a smaller community and slower OSS cycle than vLLM. Production gaps occasionally show up (autoscaling integrations, observability, model coverage on edge cases). We use it where the workload demands it; not where vLLM would do.
The decision tree we actually use
- Are you serving one model at < 16 GPU scale and unsure of long-term traffic? → vLLM. Stop here.
- Is your workload structured-output heavy (JSON, agent loops, > 40% prefix-shareable)? → SGLang.
- Do you have one model at 16+ GPUs, NVIDIA-only, with stable config? → benchmark TensorRT-LLM against your vLLM baseline. Only switch if the throughput delta > 15% AND the rebuild cost is amortizable in < 90 days.
- Are you mixing prefill-heavy and decode-heavy traffic on the same GPUs? → consider disaggregated prefill/decode (Dynamo + TensorRT-LLM, or llm-d). The gain only justifies the complexity at scale.
- Multi-tenant SaaS serving many models with rapid iteration? → vLLM. The TRT compile loop will eat your week.
What the benchmarks actually say
Public numbers are noisy because everyone benchmarks different model sizes, batch shapes, and prompt distributions. The honest summary as of mid-2026:
- Llama-3.3-70B FP8 on 8× H100 SXM, server scenario: TensorRT-LLM ~12,000 tok/s peak; vLLM ~10,200 tok/s peak; SGLang ~10,500 tok/s peak.
- Same hardware, mixed-prefix RAG workload: SGLang wins by ~2× on prefill due to RadixAttention; the others catch up only with hand-tuned cache configs.
- Mistral-Small-24B on 1× H200, latency-sensitive single-stream: all three within 5%; vLLM's shorter cold start usually wins overall.
- Build/iterate cycle on a developer laptop config change: vLLM ~2 minutes, SGLang ~3 minutes, TensorRT-LLM 20 min – 2 hours.
Numbers are conservative midpoints from public vLLM, NVIDIA, and SGLang benchmark blogs; your hardware and workload will land somewhere in a band. We rerun on customer traffic during the optimization engagement, that is what produces the number you actually care about.
What we recommend most often
For 80% of the migrations we run, the answer is: start on vLLM, run it for 8 weeks, then revisit. Once you know your real prompt distribution, real cache hit rate, real GPU saturation, and real latency budget, the "should we move to TRT-LLM" question answers itself with concrete numbers instead of vendor copy.
Do not pick a runtime first and then build a workload around it. Pick the workload, measure it, then pick the runtime.