Migrating off OpenAI in 6 weeks without breaking your evals
A week-by-week breakdown. What can be parallelized, what cannot, and where the failure modes hide.
The minimum realistic time-to-cutover for a clean OpenAI-to-self-hosted migration is 6 weeks. Faster than that and you are skipping eval work that you will pay for later. Slower than 14 weeks usually means scope creep, the right interventions exist for that scope but they should have been smaller.
This is the 6-week version, which works for a single mid-tier workload (one model, one product surface, no exotic compliance requirements). Most of our engagements run this shape; longer ones handle special cases at the edges.
#Week 0, pre-engagement (audit)
Before week 1, the audit has answered: what is the volume, what is the model candidate, what is the eval bar, what does success look like. Both teams have signed off on the parity tolerance. The cutover plan exists in writing.
If any of those are missing, week 1 starts with a delay.
#Week 1, baseline + infra
Customer team: capture a 7-day production traffic sample with prompt/response pairs and per-request metadata. Identify the 200–500 prompts that will form the workload-specific golden set. Confirm the GPU footprint (rented or owned) is provisioned and accessible.
fastpriors: set up the candidate runtime (usually vLLM as default) on the GPU footprint, smoke-test with a public benchmark, establish that the operational basics work, health checks, auto-restart, log aggregation.
End of week 1 deliverable: baseline OpenAI eval numbers on the golden set + standard benchmarks. Working vLLM deployment that responds to /v1/completions with the right model loaded.
#Week 2, eval infrastructure + parity test 1
Customer team: instrument the eval pipeline to run against both endpoints. Customer-side rubric work for any non-standard quality dimensions (tone, style, calibration).
fastpriors: first end-to-end eval-parity run between OpenAI baseline and the candidate model running on vLLM. Identify regressions; categorize them as (a) within tolerance, (b) over tolerance fixable via tuning, (c) over tolerance requiring scope adjustment.
End of week 2 deliverable: first parity report. If category (c) regressions exist, this is where the engagement scope is renegotiated. Most don't hit (c) but it happens.
#Week 3, quantization + optimization 1
fastpriors: apply the highest-leverage quantization (typically FP8 on H100, INT8 on A100). Re-run eval suite to confirm parity is maintained. Set up continuous batching parameters for the customer's prompt distribution.
Customer team: implement a feature flag that can route per-request traffic between OpenAI and the new endpoint. This is the cutover mechanism.
End of week 3 deliverable: quantized model running with eval parity. Feature flag wired in production code (still defaulting to OpenAI). Per-request routing capability tested.
#Week 4, shadow traffic
Both teams: route 100% of production traffic to OpenAI in production (no behavior change for users). In parallel, mirror that traffic to the new endpoint, drop the responses, and capture metrics.
What we are watching: tail latency under real load, error rate, eval drift on the real-distribution traffic (vs the static golden set). This is when the things that did not show up offline surface, long tails, distribution shift, edge cases.
End of week 4 deliverable: 7 days of shadow traffic data. Comparison of new endpoint metrics vs OpenAI baseline. Decision point: are we ready to flip 5% of real traffic to the new endpoint?
#Week 5, staged rollout
5% rollout for 48 hours, watching the live drift dashboard, error rates, customer-side product metrics that depend on inference quality (NPS, completion rates, escalation rates).
Then 25% for 48 hours. If clean, 50%. If clean, 100%.
OpenAI stays warm as a fallback. If anything goes wrong, the feature flag flips back instantly. Most rollouts do not flip back; some have to, and the warm fallback is what lets you do it without panic.
End of week 5 deliverable: 100% of new traffic on the self-hosted endpoint. OpenAI fallback retained. Production drift dashboard live.
#Week 6, handover
fastpriors: document the runbooks. On-call playbook, eval drift response, common incidents, dependency upgrade procedure. Train the customer team through a runbook walk-through and a deliberate fault-injection drill.
Customer team: first on-call rotation including the new endpoint. We pair-on-call for week 6 and 7, then they own it.
End of week 6 deliverable: handover document signed. Customer team has shipped one configuration change to the new endpoint themselves. We are reachable for clarification questions for 90 days post-engagement at no charge.
#Where the failure modes hide
The 6-week schedule has three high-risk points:
- End of week 2: the parity test reveals a regression that does not fit in the planned tuning effort. Scope renegotiation needed. We catch this early because the parity work is week 2, not week 5.
- Middle of week 4: shadow traffic reveals an edge case that the static eval did not catch. Usually fixable with a tuning change; occasionally requires scope renegotiation.
- Beginning of week 5: the customer team realizes they are not yet comfortable owning the new system. The schedule extends by 1–2 weeks of additional pairing. We plan for this implicitly via the 90-day post-engagement support window.
#What does NOT belong in 6 weeks
- Multi-region deployment from scratch.
- Custom kernel work.
- Compliance-heavy deployment (HIPAA, FedRAMP, ITAR), add 4–6 weeks for the compliance work.
- Migrating multiple models simultaneously, do them in series.
- Net-new evaluation infrastructure if the customer has none, add 2–3 weeks.
If your situation has any of those, the migration schedule is 8–14 weeks instead of 6. The shape is the same; the phases are longer.
#The single biggest piece of advice
Do the eval-parity work in week 2. Not week 5. The teams that fail at this kind of migration almost universally underestimated the eval phase, started it too late, and ended up either shipping a regression or rolling back the cutover.
If you only remember one thing from this playbook, it is that the eval suite is the gate. Build it, measure against it, do not flip traffic without it. Everything else is detail.