FastPriors
Talk to an engineer
← all postsfield notes · 7 min

The cost-audit one-pager we send to every prospect

Open-sourcing our audit format. Use it. Steal it. Send it back to us.

March 30, 2026by Abhimanyu Singh

Every fastpriors engagement starts with a written cost audit. The audit is a single page (sometimes two) that answers a small number of questions in numbers, not opinions. We are publishing the format because it works, and because the AI infra industry has too much vendor copy and not enough decision-grade artefacts.

You can use this as-is to evaluate your own situation before you talk to us, or any other consultancy.

#The questions the audit answers

  1. What does the customer pay today? Per-month total, per-model breakdown, per-1M-tokens normalized.
  2. What would they pay self-hosted? Modeled per-1M-tokens cost on their workload, on their preferred GPU tier, at realistic utilization.
  3. What is the savings band? Best case (high utilization, optimized stack), expected case (mid utilization, untuned), and risk case (low utilization, ops overhead).
  4. What is the migration cost? Engineering effort in weeks, capex if any, ramp risk.
  5. What is the payback period? Months until the savings exceed the migration cost.
  6. Should they do it? A direct recommendation. Sometimes the answer is no, and we say so.

#What the one-pager looks like

Two columns, ten data points each. No prose paragraphs. Numbers with the assumptions visible. One graph if the data warrants it; usually it does not.

fastpriors cost audit, Acme, Inc.   2026-04-12

current state                       projected state
─────────────────────────────────────────────────────
provider:    OpenAI gpt-4.1-mini    runtime:    vLLM, FP8
volume:      4.2B output tok / mo   model:      Llama-3.3-70B
$ / 1M out:  $1.60                  hardware:   8× H100 SXM (RunPod)
monthly:     $6,720                 utilization: 65% modeled
input cost:  $1,820                 $ / 1M out:  $0.41 modeled
TOTAL:       $8,540 / mo            ops overhead: +25%
                                    monthly:    ~$2,200

savings band                        migration
─────────────────────────────────────────────────────
expected:    -74% / mo              engineering: 8 weeks
worst case:  -52% / mo              eval parity: 95% CI < 1.5% delta
best case:   -82% / mo              prod risk:   shadow + 30-day fallback
annual:      ~$76K                  one-time:    $XX,XXX (audit-refundable)
                                    payback:    ~3.5 months

recommendation: PROCEED
─────────────────────────────────────────────────────
the math justifies the migration. throughput-shaped traffic
(consistent through-day) and shareable system prompts make
this an above-median candidate. expected outcome: 60-75% per-
token cost reduction at unchanged eval parity.

caveats:
- tone-sensitive customer support use case requires mixed-
  precision FP8 (see fp8-inference-h100 writeup).
- 6-week shadow phase, no user-facing flip until eval parity
  signed off in writing.
- bedrock fallback retained for 30 days post-cutover.

#What is in the rows, in detail

Volume, measured directly from the customer's usage logs, not estimated from a survey. We need a representative 7-day sample minimum, ideally a 30-day sample.

$ / 1M tokens, separate input and output prices for hosted; blended price for the model used. For self-hosted, the per-token rate is computed as (GPU-hour cost × stacks needed) ÷ (achievable tokens-per-hour).

Achievable utilization, pulled from public vLLM/TensorRT-LLM benchmarks for the candidate hardware, derated based on the customer's traffic shape. Continuous traffic gets a higher number; spiky traffic gets a lower one.

Ops overhead, flat 25% for an established team that already runs Kubernetes; 35% for a team starting from scratch. This is to cover monitoring, on-call, eval drift detection, hardware failure response.

Eval parity tolerance, agreed before the audit ships. For most customers it is <1.5% per-task delta with 95% CI on their existing eval suite. We will recommend changes to the eval suite if it is too thin to detect realistic regressions.

Prod risk, what we will do during the migration to keep the user-facing product unaffected. Shadow traffic is the default; for some workloads we replay archived traffic instead.

#What "recommendation" can say

  • PROCEED, savings clear the threshold, no structural blockers.
  • PROCEED, scoped, savings clear the threshold for part of the workload only. Hybrid recommended.
  • WAIT, savings are real but the threshold is too close. Revisit when volume grows or tooling matures.
  • DECLINE, savings do not justify migration. Stay hosted. We have written this and walked away from engagements.

The DECLINE outcome happens roughly 1 in 6 audits. We charge for the audit either way. That is not us being mercenary; it is what makes the audit worth reading. If we only ever said PROCEED, the recommendation would carry no information.

#Why we ship this format

A two-page deck would have more graphs and less signal. A 30-page report would have more reasoning and not be read. The one-page format forces us to be specific: every cell on the page is a load-bearing number that we have to defend. The customer reads it in five minutes, sends it to their CFO, and gets a decision back the same day.

If you want one of these for your workload, we run them weekly. If you want to run one yourself first, the format above is yours to use.

Want this on your stack?

The cost audit lands at a number, not a recommendation. Refundable.

Talk to an engineer →Try the calculator