resources

Benchmarks. Playbooks. Receipts.

Everything we've learned, written down. Including the things that didn't work, with enough detail that you can avoid making the same mistakes.

in preparation

Field notes.

Benchmarks and playbooks we're packaging up for public release. Need one before it ships? Tell us which.

benchmark · Q3 2025

benchmark

Llama 3.1 70B inference: hosted vs self-hosted on 8× H100

Q3 2025drafting

benchmark · Q3 2025

benchmark

FP8 vs INT8 quantization across 14 production workloads

Q3 2025drafting

writeup · Q2 2025

writeup

Speculative decoding in practice: when 4× speedups vanish

Q2 2025drafting

playbook · Q2 2025

playbook

The migration runbook: hosted API → owned VPC, step by step

Q2 2025drafting

tool · github

tool

fp-eval: an open eval-parity harness for migrations

github · MITdrafting

calculator · interactive

calculator

Hosted-vs-owned spend calculator (with assumptions)

interactivedrafting

tool · github

tool

kv-layout: KV cache layout explorer for custom kernels

github · MITdrafting

writeup · Q4 2025

writeup

What we learned from 14 production migrations in 2025

Q4 2025drafting

playbook · Q3 2025

playbook

Sub-50ms RAG: a latency budget worked example

Q3 2025drafting

methodology

How we measure.

Every benchmark on this site comes with assumptions, hardware, and a reproducible script.

Real workloads, real traffic

Numbers from actual production rollouts, sampled across the day. Not synthetic stress tests.

Eval parity before perf

We don't quote a speedup unless the model passes the same eval suite as before, within the agreed tolerance.

Hardware disclosed

Every benchmark says exactly which SKU, how many, what interconnect, and what the host OS was running.

Failures included

We publish the cases where it didn't work, too. The interesting margin is in what doesn't optimize.

Want a benchmark for your workload?

The cost audit ends with one. Refundable.

Talk to an engineer →