FastPriors
Talk to an engineer
← all postscase study · 11 min

Sub-50ms RAG with a $14K monthly budget

Anonymized: how a Series A team cut their retrieval latency in half on the same hardware. The architecture, the surprises, the parts we got wrong.

March 18, 2026by Abhimanyu Singh

Anonymized case study from a Series A B2B SaaS shipping a RAG-driven feature against a 30M-document corpus. They came to us with a $14K/month inference bill, a 110ms p95 retrieval latency, and a product team that needed both numbers cut roughly in half. We hit 31ms p95 at $13.6K/month, same hardware, same vendor mix, different stack.

Here is how, with the parts that were not obvious.

#Starting state

Retrieval pipeline:

  1. Query embedding via OpenAI text-embedding-3-small
  2. Vector search in Pinecone (top-100)
  3. Re-rank via OpenAI gpt-4.1-mini using a structured prompt
  4. Top-5 results stitched into the LLM prompt for generation

Latency split was approximately:

  • Embedding: 18ms p95
  • Vector search: 22ms p95
  • LLM rerank: 58ms p95
  • Generation prefill: 12ms p95

The rerank was the bottleneck, but it was also expensive ($4.8K/month at the rerank step alone). Killing the rerank entirely would have saved both numbers but at a measurable quality cost, they had run that experiment and shipped it back.

#What we changed

The migration was three things, in order:

1. Move the embedding to a dedicated cross-encoder

OpenAI's text-embedding-3-small is good but generic. We swapped it for a small bi-encoder fine-tuned on their domain (using their existing labeled data plus a 50K weak-label set we generated from production logs). Fine-tune ran in 6 hours on 4× H100. Embedding latency dropped from 18ms to 4ms because we self-hosted, and recall@100 improved by 12% measured on their eval set.

2. Replace LLM rerank with a cross-encoder reranker

This was the highest-leverage change. We trained a small cross-encoder reranker on the same labeled data, deployed it self-hosted, and retired the LLM rerank step entirely. Quality on their eval set was within 1.8% of the LLM rerank. Latency: 58ms → 9ms. Cost: $4.8K/month → ~$200/month of GPU.

The trick to making this work was the eval. Without their existing labeled rerank data, this swap would have been a quality bet. With the data, it became a measurement.

3. Co-locate the vector store

Pinecone was hitting their app from us-east-1. Their inference was in us-west-2. Network latency was eating ~6ms per call. We ran a self-hosted Qdrant in their VPC, replicated the corpus, and routed retrieval there. Latency: 22ms → 13ms (mostly network reduction; the actual search is ~7ms either way).

The Qdrant cluster cost $1.6K/month, which they pocketed back from the Pinecone bill plus another ~$2K in egress.

#Final pipeline

  1. Query embedding via in-VPC bi-encoder: 4ms p95
  2. Vector search in in-VPC Qdrant: 13ms p95
  3. Cross-encoder rerank in VPC: 9ms p95
  4. LLM prefill (now using the existing self-hosted 8B model): 5ms p95

Total: 31ms p95. Down from 110ms.

Per-month spend:

  • Embedding: ~$400 (was $2,200 OpenAI)
  • Vector store: $1,600 (was $3,800 Pinecone)
  • Reranker: ~$200 (was $4,800 OpenAI rerank)
  • LLM: $11,400 (unchanged from prior self-hosted setup)

Total: $13,600/month. Saved $410 plus all the operational pain of cross-account egress and rate-limit retries.

#What we got wrong

The first attempt at the cross-encoder reranker failed. We trained on the wrong loss function (pairwise hinge loss when the data was actually graded relevance with multiple ordinal levels). Quality was 6% below the LLM rerank, not acceptable. Retrained with a listwise loss (LambdaRank), got within tolerance.

The second mistake was on the embedding model, we initially shipped a 768-dim embedding to match the original. We later realized their search workload was not bottlenecked on retrieval recall, and a 384-dim embedding would have been fine. The 768-dim version cost an extra ~$120/month in vector store memory we did not need to pay. We left it in place because the saving was small and the change would have required reindexing.

#What did not change

The LLM stack itself, the model serving the final generation step, was untouched. They had already self-hosted that on 4× H100 SXM and the latency was acceptable. The retrieval-side pipeline was the bottleneck, and that is where the engineering went.

This is a recurring pattern. Teams optimizing RAG often start with the LLM, where there is the most public information about how to optimize, but the actual leverage is usually in the retrieval pipeline (embedding, store, reranker) which gets less attention.

#What this generalizes to

Replacing an LLM-as-reranker with a cross-encoder is the highest-leverage RAG optimization most teams have not done yet. It only works if you have labeled rerank data, but most production RAG products generate that data implicitly (via click-through, dwell-time, explicit feedback), they just have not surfaced it.

The other generalizable lesson: co-location matters. A vector store in a different region from your LLM serving costs a constant 5–15ms. That is half a latency budget for a sub-50ms RAG product.

Want this on your stack?

The cost audit lands at a number, not a recommendation. Refundable.

Talk to an engineer →Try the calculator