resources
Benchmarks. Playbooks. Receipts.
Everything we've learned, written down. Including the things that didn't work, with enough detail that you can avoid making the same mistakes.
in preparation
Field notes.
Benchmarks and playbooks we're packaging up for public release. Need one before it ships? Tell us which.
benchmark · Q3 2025
benchmark
Llama 3.1 70B inference: hosted vs self-hosted on 8× H100
benchmark · Q3 2025
benchmark
FP8 vs INT8 quantization across 14 production workloads
writeup · Q2 2025
writeup
Speculative decoding in practice: when 4× speedups vanish
playbook · Q2 2025
playbook
The migration runbook: hosted API → owned VPC, step by step
tool · github
tool
fp-eval: an open eval-parity harness for migrations
calculator · interactive
calculator
Hosted-vs-owned spend calculator (with assumptions)
tool · github
tool
kv-layout: KV cache layout explorer for custom kernels
writeup · Q4 2025
writeup
What we learned from 14 production migrations in 2025
playbook · Q3 2025
playbook
Sub-50ms RAG: a latency budget worked example
methodology
How we measure.
Every benchmark on this site comes with assumptions, hardware, and a reproducible script.
Real workloads, real traffic
Numbers from actual production rollouts, sampled across the day. Not synthetic stress tests.
Eval parity before perf
We don't quote a speedup unless the model passes the same eval suite as before, within the agreed tolerance.
Hardware disclosed
Every benchmark says exactly which SKU, how many, what interconnect, and what the host OS was running.
Failures included
We publish the cases where it didn't work, too. The interesting margin is in what doesn't optimize.