Pricing transparency: what we charge and why
Our pricing model, plus the assumptions baked into each engagement type. Posted publicly because procurement teams should not have to extract this on a call.
Most consultancy websites are deliberately vague about pricing. The reasoning is usually that pricing depends on the engagement and a generic number would mislead. We agree with the principle. We disagree with the conclusion. The honest answer is to publish bands, with the assumptions visible, and discuss specifics on the discovery call.
Below is what we charge, with the reasoning. None of this is final, your number depends on the work, but the bands are accurate within ±20% for engagements we have shipped.
#The cost audit
Fixed-fee in the low five-figures USD. One week of engineering work. Refundable against any engagement that follows.
What you get: a 1–2 page written report (see /blog/cost-audit-one-pager for the format) plus a 30-minute walkthrough call to discuss it. Includes the volume measurement, the modeled self-hosted cost, the savings band, the migration cost estimate, and a direct PROCEED / WAIT / DECLINE recommendation.
Why this exists: removes the largest source of waste in early-stage consulting engagements, which is a customer paying for a migration that should not happen, or skipping a migration that should. The audit is the gate.
#Inference migration
Fixed-fee in the mid-to-high five-figures USD, depending on scope. 6–14 weeks. Includes the runtime swap, the eval-parity work, the cutover plan, the production rollout, and the post-cutover handover. 30 days of post-engagement support included.
Cost band drivers:
- Number of models being migrated (one model is < two models, not 1× → 2×)
- Whether eval infrastructure exists or has to be built
- Whether the customer has an existing GPU footprint or we are sizing from scratch
- Compliance constraints (HIPAA, EU residency, air-gapped)
- Whether rollback infrastructure exists
What pushes price up: novel architectures, custom kernels needed, multi-region deployment from scratch, an eval suite that needs to be built rather than extended.
What pushes price down: existing self-hosted infrastructure, a mature CI/CD path, a clean baseline eval suite, a single model on standard NVIDIA hardware.
#Inference optimization
Fixed-fee in the mid-five-figures USD. 4–8 weeks. For teams that are already self-hosted but the cost or latency is wrong.
Includes profiling, runtime selection, quantization study, batching strategy, and the deployment of the optimization to production. We commit to a target, p95 latency, tokens/sec, $/M tokens, and ship until we hit it. If we cannot, the engagement converts to a partial refund.
#Custom kernels
Quoted per-engagement based on the kernel scope. Typically high-five to low-six figures USD for 3–10 weeks.
Comes with the six-checks decision (see /blog/custom-kernels-worth-it). If the kernel work fails the checks, we will tell you on the discovery call and not propose the engagement. Most kernel pitches we hear do not pass the checks.
#Hardware advisory
Fixed-fee in the low-five-figures USD. 2–3 weeks. What to buy, when to rent, when to colo. Vendor-agnostic, we have no resale agreements with any GPU vendor or cloud.
Output: a written workload sizing, SKU comparison, power and networking spec, procurement timeline. Suitable for procurement and finance to act on.
#SRE bridge
Fixed-fee monthly in the mid-five-figures USD, with a hard 12-week cap. Three-month bridge contracts only. We share on-call rotation while your team learns the system, then we leave on schedule.
This is the engagement we accept the fewest of, because the success criterion is your team being able to operate the system without us. We do not extend these.
#What we do not do
- Revenue share or percentage-of-savings billing. The incentive is misaligned. We do better work on a fixed fee.
- Token markup or hosted-platform billing. We do not run hosted infrastructure for clients. Your bill is for the engineering, not the inference.
- Long-term retainers. We are a project shop. Your team should be able to operate what we build after we leave.
- Resale. We have no agreements with NVIDIA, AMD, RunPod, Lambda, CoreWeave, AWS, GCP, Azure, or any other vendor. We pick what fits the workload and we tell you when we have a preference and why.
#Why fixed-fee
Time-and-materials billing creates an incentive to be slow. Revenue-share creates an incentive to oversell. Fixed-fee with a clear scope creates an incentive to ship and walk away, which is what most clients actually want.
The asymmetric outcome is that occasionally we underbid the work and absorb the difference. That is on us. It happens. The alternative, bid generously, charge by the hour, slow-roll, is worse for the client.
#The discovery call
The discovery call is free and we use it to figure out three things: is your problem in our scope, is the math worth it, and do we have capacity. If the answer to all three is yes, the next step is the cost audit. If the answer is no, we will say so on the call, and refer you to someone better suited if we know one.
Most discovery calls end with a clear next step within 30 minutes. The ones that do not are usually because the customer is in early discovery themselves and the migration question is not yet specific enough to scope. That is fine; we tell them and follow up in a quarter.