Field notes from the
inference trenches.
We write a few times a month. Mostly engineering writeups, occasionally a longer essay, sometimes a postmortem on something we got wrong.
Why your inference bill is your moat (and how it's also a leash)
A long argument that the next durable advantage in AI products is owning the cost curve. Hosted APIs are great — until your gross margin is somebody else's revenue line. Here's how we think about the tradeoff, and the inflection point at which it pays to migrate.
What we're writing.
Posts in progress. We publish when they're ready, not on a schedule. Want one of these on a deadline? Tell us.
Speculative decoding doesn't help if your batch size is wrong
Three case studies where the same trick gave a 4× and a 0.94×.
drafting →FP8 inference on H100: a worked migration
Hands-on log of moving a Llama 3.1 70B workload from BF16 to FP8 on H100.
drafting →The cost-audit one-pager we send to every prospect
Open-sourcing our audit format. Use it. Steal it. Send it back to us.
drafting →Sub-50ms RAG with a $14K monthly budget
Anonymized: how a Series A team cut their retrieval latency in half on the same hardware.
drafting →When custom kernels are worth it (and when they're not)
A decision tree we use internally. Mostly: don't, until you've checked these six things.
drafting →What 14 migrations taught us about eval parity
Eval parity is the part everyone underestimates. Here's how we structure it now.
drafting →Pricing transparency: what we charge and why
Our full price list, plus the assumptions baked into each engagement type.
drafting →A quiet defense of bare metal
Cloud is great. Bare metal is also great. Here's when we recommend each.
drafting →