When custom kernels are worth it (and when they're not), FastPriors

Custom CUDA or Triton kernel work is the most senior engineering most inference shops never do. It is also where consultancies oversell. We are guilty of it ourselves, when a customer is paying for "deep" expertise, the incentive to write a custom kernel is structural, even when the gain is small.

The honest answer is that custom kernels are worth it for a small number of well-defined situations, and not worth it for most of what gets pitched as "we will write you a custom kernel."

#The six checks

Before recommending custom kernel work to a customer, we run through six questions:

Is the operation hot? If the kernel in question is < 5% of total wall-clock time, no kernel work is going to move your numbers measurably. Profile first.
Is there an off-the-shelf kernel that fits? FlashAttention-4, vLLM's paged attention kernels, NVIDIA's own fused MoE kernels, the open ecosystem covers more than people realize. Custom is for what is not covered.
Is the operation shape stable? Custom kernels work well when the input shapes are predictable. If your sequence length distribution has a fat tail, a one-size-fits-all kernel ends up suboptimal at the tails.
Will the kernel survive a model change? If you write a custom kernel for Llama-3.3 attention and then move to Llama-4, can you keep the kernel? If not, the engineering investment depreciates fast.
Do you have the engineering depth to maintain it? A custom kernel is not a write-once artifact. It needs to be updated for new GPU architectures, debugged when fused ops produce numerical issues, and re-benchmarked when neighboring kernels change.
Is the speedup big enough to justify the maintenance cost? A 1.2× kernel-level speedup that turns into a 1.05× end-to-end speedup is rarely worth the ongoing engineering attention. We look for 2× or larger end-to-end before recommending custom work.

If at least four of these check out, custom kernel work might be the right call. If only two or three, we redirect the engineering investment elsewhere.

#Cases where custom kernels were worth it

Sparse MoE routing. Hardware-vendor kernels for MoE routing assume a uniform distribution of routing decisions. One customer's workload had heavy long-tail bias, a few experts were saturated, most were idle. Custom routing kernel that handled the imbalance produced a 2.3× MoE-step speedup and a 1.4× end-to-end speedup.

Custom KV cache layout for long-context. A customer running long-context (32K+) on a workload where most queries shared a long system prompt had a lot of duplicated KV cache work. We wrote a paged-attention kernel that exploited the sharing pattern. End-to-end speedup: 2.8× on that specific workload.

Quantization-aware fused ops. Per-channel INT8 quantization with the dequant fused into the matmul. Available in TensorRT-LLM but not at the time on the customer's preferred runtime. Wrote it in Triton, ~6 weeks. End-to-end speedup: 1.9×.

#Cases where custom kernels were not worth it

Custom attention variant for a niche application. Customer wanted us to implement a sliding-window-with-skip-connection attention variant for a long-document workload. We implemented it, it worked, and the end-to-end speedup over FlashAttention-4 was 1.08×. Six weeks of work for an 8% gain on a single product surface. We told them not to ship it.

Fused FFN for a model the customer was about to deprecate. The customer was migrating from a custom in-house architecture to Llama-3.3 within the next quarter. Writing a kernel for the in-house model would have produced ~1.5× speedup that would last 8 weeks before becoming irrelevant.

Generic FlashAttention rewrite. "Could you write us a faster FlashAttention?", the answer is no. Tri Dao and the maintainers have hand-tuned that kernel against every modern GPU architecture. Beating it requires either a niche use case (see above) or compute-cost-no-object research effort, neither of which a consulting engagement is the right vehicle for.

#What we suggest instead, when kernel work fails the checks

The interventions that produce more value than custom kernels for most workloads:

Better batching policies. Continuous batching is in vLLM/TRT-LLM/SGLang. Tuning the batch policy for your specific traffic shape often matches what kernel work would have given.
Better scheduling. KV-aware request routing across GPUs, prefix-cache-aware scheduling, mixed prefill/decode disaggregation.
Better quantization. Going from FP16 to FP8 or INT8 is a real 1.4–1.8× speedup with off-the-shelf tooling and no kernel writing.
Better topology. NVLink vs InfiniBand vs RoCE for tensor-parallel shards has a larger impact than most kernel work.

#When we offer it as a service

The custom kernel engagement is one we explicitly time-box and price. 3–10 weeks, fixed-fee, with a measurable speedup target written into the contract. If we cannot beat the off-the-shelf alternative by 2× end-to-end on the agreed benchmark, the engagement converts to a refund minus the audit cost.

This is not us being heroic. It is us being honest that custom kernel work has a real failure mode (the speedup does not show up) and the customer should not bear the cost when it does.

#The summary

Custom kernels are real, valuable engineering. They are also one of the easiest things in inference to oversell. The six checks above are how we keep ourselves honest about which engagements deserve them. If you are evaluating someone else's kernel pitch, the checks work just as well in the other direction.

When custom kernels are worth it (and when they're not)