blog

Field notes from the
inference trenches.

We write a few times a month. Mostly engineering writeups, occasionally a longer essay, sometimes a postmortem on something we got wrong.

featured · long-form

essay · 18 min

Why your inference bill is your moat (and how it's also a leash)

A long argument that the next durable advantage in AI products is owning the cost curve. Hosted APIs are great — until your gross margin is somebody else's revenue line. Here's how we think about the tradeoff, and the inflection point at which it pays to migrate.

May 2026by Aarav Shenoy

essay drafting · check back soon

drafting

What we're writing.

Posts in progress. We publish when they're ready, not on a schedule. Want one of these on a deadline? Tell us.

in draft9 min · engineering

Speculative decoding doesn't help if your batch size is wrong

Three case studies where the same trick gave a 4× and a 0.94×.

drafting →

in draft14 min · writeup

FP8 inference on H100: a worked migration

Hands-on log of moving a Llama 3.1 70B workload from BF16 to FP8 on H100.

drafting →

in draft7 min · field notes

The cost-audit one-pager we send to every prospect

Open-sourcing our audit format. Use it. Steal it. Send it back to us.

drafting →

in draft11 min · case study

Sub-50ms RAG with a $14K monthly budget

Anonymized: how a Series A team cut their retrieval latency in half on the same hardware.

drafting →

in draft20 min · engineering

When custom kernels are worth it (and when they're not)

A decision tree we use internally. Mostly: don't, until you've checked these six things.

drafting →

in draft13 min · field notes

What 14 migrations taught us about eval parity

Eval parity is the part everyone underestimates. Here's how we structure it now.

drafting →

in draft8 min · company

Pricing transparency: what we charge and why

Our full price list, plus the assumptions baked into each engagement type.

drafting →

in draft16 min · essay

A quiet defense of bare metal

Cloud is great. Bare metal is also great. Here's when we recommend each.

drafting →

Have a problem worth writing about?

Tell us. We might take it on.

Talk to an engineer →

Field notes from theinference trenches.

Why your inference bill is your moat (and how it's also a leash)

What we're writing.

Speculative decoding doesn't help if your batch size is wrong

FP8 inference on H100: a worked migration

The cost-audit one-pager we send to every prospect

Sub-50ms RAG with a $14K monthly budget

When custom kernels are worth it (and when they're not)

What 14 migrations taught us about eval parity

Pricing transparency: what we charge and why

A quiet defense of bare metal

Have a problem worth writing about?

Field notes from the
inference trenches.