A quiet defense of bare metal
Cloud is great. Bare metal is also great. Here's when we recommend each, and why the conventional wisdom is wrong on both sides.
The default answer in 2026 to "where should we run our inference?" is the cloud. AWS, GCP, Azure, the GPU specialty providers, RunPod, Lambda, CoreWeave. The honest secondary answer in many cases is bare metal, and the conventional wisdom against it is mostly out of date.
This is a defense of bare metal as a serious option for serious AI inference workloads, with the cases where it actually wins. Not a religion, most teams should still pick cloud. But the cases where bare metal beats cloud are larger than people realize.
#The argument against bare metal, in conventional form
- Up-front capex is large.
- You have to find a colo, deal with hardware procurement, manage hardware failures.
- You lose elasticity, cannot scale to zero, cannot burst.
- You take on operational complexity that cloud abstracts.
All of these are true. None of them are decisive at scale.
#The math
Cost per H100-equivalent GPU-hour, May 2026:
- RunPod community: ~$2.00
- RunPod secure: ~$2.40
- Lambda Cloud: ~$2.99
- CoreWeave: ~$4.25
- AWS p5 on-demand: ~$8.00 (effective for H100 SXM)
- Bare metal, 3-year amortization (purchase + colo + power): ~$1.10–$1.30
The bare-metal number assumes you actually use the GPU 80%+ of its lifetime. That is the real catch. If you can keep H100s saturated for 3 years, bare metal is roughly half the price of the cheapest cloud and a quarter the price of hyperscaler retail.
Capex is large but financeable. A single 8× H100 SXM box runs around $400K all-in (server + GPUs + initial setup). Most colos will rack-and-stack for ~$1,500/month including power, cooling, and 100Gbps networking. A 3-year payback is achievable for any team running steady inference at the $50K+/month cloud-equivalent volume.
#The cases where bare metal wins
Steady high-volume inference. If your traffic is predictable through the day and you keep GPUs saturated, bare metal's economics dominate.
Compliance-driven workloads. Air-gapped, ITAR, EU data residency where the cloud option is restricted to a specific zone. Bare metal in your own colo gives you the most control and the cleanest audit story.
Long-context workloads with large KV cache. Cloud GPUs come with default networking that is fine for most things and bad for tensor-parallel inference at scale. Bare metal lets you spec InfiniBand or NDR networking properly.
Multi-tenant SaaS with predictable load. If you are serving 1,000 enterprises and your peak is 2× your trough, bare metal's lack of elasticity is not a real problem and the cost saving funds your engineering.
Cases where data egress costs matter. Cloud GPUs that pull data from your on-prem stores or push results to your CDN incur egress charges. Bare metal in a colo with peering arrangements eliminates this layer entirely.
#The cases where cloud still wins
Variable load, especially burstable. If your workload has a 5× peak-to-trough ratio, paying for bare metal at peak means most of your spend is idle most of the time. Cloud's elasticity is real money.
Early-stage product iteration. Before you know what your inference workload looks like in steady state, bare metal commits you to capacity assumptions you cannot yet defend.
Multi-region serving. If you need a global footprint, multiple bare-metal colos is a much harder operational problem than multi-region cloud.
Specialty hardware you only need occasionally. Need a few B200s for a one-week training job? Cloud rental is the right answer. Bare metal is for steady-state inference, not for elastic experimentation.
Teams without ops bandwidth. Bare metal has real operational overhead, hardware failures, BIOS updates, kernel-level work occasionally. If your team cannot dedicate part of an SRE's time to physical infrastructure, cloud is the right call regardless of cost.
#The hybrid pattern
The most common production answer for serious AI workloads in 2026 is hybrid. Bare metal in a colo for steady-state base load, cloud GPUs for burst capacity and overflow. The base load handles the bulk of the cost; the cloud overflow handles the variability.
This is operationally more work than picking one. It requires routing logic that decides where each request goes and a control plane that can move traffic when colo capacity hits a threshold. We have built three of these. The marginal complexity is real but the cost saving is large enough to fund the engineering, and once it is built it does not need much ongoing attention.
#What people get wrong about bare metal
"You need a data center engineer." You need a colo provider, which is different. Equinix, Coresite, INAP, Volt, these companies will rack your hardware, run your power, and handle the physical layer. Your team operates the GPUs, not the building.
"Hardware failures will eat you alive." Modern enterprise GPUs have 4–6 year MTBF on the chip. Power supplies fail more often than that, and they are hot-swappable. A 32-GPU bare-metal deployment in 2026 will see roughly one component failure per quarter, all replaceable in < 4 hours by colo staff. Plan for it; it is not a crisis.
"You cannot scale to zero." Correct, but mostly irrelevant for inference. Inference workloads are usually serving real-time traffic; scaling to zero is a training pattern.
"Procurement takes 6 months." H100 lead times have come down in 2026 to 8–12 weeks for most quantities. B200s are on a similar curve. Plan for it; it is not a deal-breaker.
#What we recommend, by stage
- < $20K/month inference spend: stay cloud, almost always. Picking GPU-specialty providers (RunPod, Lambda) over hyperscalers will save you 50%+ at no operational cost.
- $20K–$100K/month: mostly cloud, but the math for bare metal starts to work. Worth modeling.
- $100K–$500K/month: hybrid is usually optimal. Bare metal for the predictable 70% of traffic, cloud for the spikes.
- $500K+/month with predictable volume: bare metal is the default unless compliance, geography, or product roadmap argues otherwise.
#The summary
Cloud is the right default. Bare metal is the right answer in cases that are more common than people in the industry suggest. The decision is not religious; it is arithmetic, with operational risk as a tiebreaker. Most teams that should consider bare metal do not, because the conventional wisdom against it is from 2018 when GPU specialty providers did not exist and colo was actually more expensive.
If you are at a scale where it might apply, run the math. If the math says yes, the rest is execution.