Why your inference bill is your moat (and how it's also a leash)
A long argument that the next durable advantage in AI products is owning the cost curve. Hosted APIs are great, until your gross margin is somebody else's revenue line.
The first version of every AI product is built on a hosted API. That is correct. The cost of being wrong about your product hypothesis is so much higher than the cost of paying retail for tokens that no rational founder builds their own inference stack at week one. You ship the prototype on OpenAI, you find the people who will pay for it, you iterate on the product, and the money you saved on infrastructure becomes the runway you need to find product-market fit.
This is also why most companies pay too much for inference for too long.
The decision to migrate off a hosted API is not a technical decision. It is a moat decision. Every dollar of inference that you pay to a hosted provider is a dollar of margin that someone else captures. At low volumes that capture is fine, you are buying convenience and frontier capability you cannot replicate. At high volumes it is the entire shape of your business.
#The math nobody publishes
Hosted inference has gross margins between 40% and 70% depending on the model and the provider. We know this because we have seen the numbers from both sides, the customer side as a six-figure monthly bill, and the operator side as the GPU rental cost the provider actually pays. Token markup is real. It funds R&D, sales, the model training itself. None of those things make your product better in a way you can quote on a customer call.
When your product reaches the point where every percentage point of gross margin matters, and that point arrives faster than people expect, the hosted bill becomes the largest line item that you can compress without breaking anything. Compressing it is engineering work. It is not magic. It is what we do.
#The leash
The other half of the moat argument is the leash. Hosted APIs are convenient until they are not. Pricing changes, retired models, rate limits, regional outages, capacity drops during product launches, these are not theoretical risks. The customers we have helped most urgently in the last two years all came to us during a hosted-side incident, where their product was degraded for hours because their entire inference stack was at the mercy of someone else's pager.
You do not need to be paranoid about this to take it seriously. You only need to think about what your product looks like during a 4-hour OpenAI outage during your busiest day of the year, and decide whether the version of you that has a self-hosted fallback path or a fully migrated stack is meaningfully more resilient than the version that does not.
#The inflection point
The break-even where self-hosted starts beating hosted on cost is around $10K/month of consistent inference traffic. Below that, the operational overhead of running your own GPUs eats the savings. Above that, the savings compound: at $50K/month, self-hosted is roughly half. At $250K/month, self-hosted is a third. At $1M/month, self-hosted is a quarter, and the savings pay for the engineers who run it.
This curve is not a secret. It is on every public pricing page if you are willing to do the arithmetic. We do the arithmetic for you in the cost audit, and what comes out the other side is a written number that says: at your current volume, here is what you save, here is what it costs to get there, here is the payback period.
#What you do not lose
The most common objection is that self-hosting means giving up frontier capability. This is a misunderstanding of what frontier means. Frontier in 2026 means GPT-4.1, Claude Opus 4.6, Gemini Ultra, and yes, you cannot replicate those at home. But the median product use case does not need frontier. It needs a 70B-class open-weights model that runs at acceptable latency, with predictable cost, on infrastructure you control. Llama-3.3, Mistral-Small, Qwen-32B all clear that bar for a wide range of products. The ones where they do not, hard reasoning, long-context coding, multi-modal, you keep on the hosted side, behind a router that calls the open model first and the hosted model only when the open one says "I am not sure."
#The hybrid is the realistic outcome
Most teams we work with end up at 70/30 or 80/20, most traffic on owned GPUs running open-weights models, the long tail on hosted APIs for the cases that actually need frontier capability. The cost saving from the 80% bulk is large enough that you can afford to pay retail for the 20%, and the resulting product is faster than the all-hosted version because the hot path no longer crosses the public internet to a vendor data center.
This hybrid is the version of self-sovereign that actually ships. It does not require religion about open source. It does not require giving up the things hosted APIs are still genuinely good at. It just requires owning the parts of the cost curve that you can.
#What this means in practice
If you are running <$10K/month: stay hosted. Spend your engineering on the product. Come back when the bill grows.
If you are running $10K–$50K/month: do the audit. The savings may or may not justify the migration depending on your traffic shape. We have walked away from engagements at this size where the math did not work; we have run them where it did.
If you are running $50K+/month: the savings almost certainly justify the migration. The question is timing, when does your team have the bandwidth to handle the cutover, and what does "done" look like for the team that will operate the system after we leave.
The inference bill is not a fixed cost. It is something you can compress, and the compression is structurally larger than almost any other infrastructure cost in your stack. Owning that compression is the moat. Paying retail forever is the leash. Most companies, somewhere along their growth curve, decide which one they want.