RAG in production: retrieval reliability, not just retrieval
In short: In a demo, a RAG system is graded on the queries you tried. In production, it must retrieve well on queries you did not try, keep doing so as the underlying documents change, and decline to answer when the evidence is not there. The failure mode almost never originates in the language model — it originates in retrieval. Measuring retrieval quality and answer-groundedness continuously, and gating on those numbers, is the discipline that separates a RAG demo from a RAG product.
NewGenApps ships AI into production and proves it works — instrumented, monitored, evaluated against ground truth, including systems answering to real users in regulated settings. That posture is applied here to retrieval. What follows reflects how retrieval reliability is actually held in production, not how it looks in a notebook.
If you are newer to what RAG is, how it differs from fine-tuning, and where it fits in an AI architecture, start with our overview of RAG, RAG vs fine-tuning, and common failure modes. This page picks up where that one ends: at the point where a retrieval system meets real users.
What changes when RAG goes to production?
In a demo, retrieval quality is graded on the handful of queries the team tried; in production, retrieval quality is the variable that determines whether the system is trustworthy, and it degrades silently as the corpus changes underneath it.
In a demo, the corpus is small and fixed, queries are hand-picked, and the verdict — "that looks right" — is reached by eye on a sample of perhaps five questions. The system passes because no one tried the questions that would have exposed its retrieval gaps.
In production, the picture is different in every dimension. The query distribution is set by real users, not by the team that built the system. Documents are added, updated, and deprecated continuously. The correctness bar is held by users and, in regulated contexts, auditors. And crucially: the failures that were invisible in the demo now surface — not as retrieval failures, which would be diagnosable, but as complaints that "the AI is hallucinating."
Most of those hallucination complaints are retrieval failures in a costume. If the passage that contains the answer never reaches the model, no model can answer correctly; it will instead produce a fluent, plausible, ungrounded guess. The production move is not to get a better model. It is to measure retrieval quality and answer-groundedness continuously, and gate on them — the same evaluated-reliability posture that applies across the cluster, pointed at retrieval.
Why do production RAG systems fail?
Production RAG failures almost always trace back to retrieval — the right passage never reached the model — not to the model generating something false from good evidence.
Here is the diagnostic taxonomy, ordered by how a practitioner should triage:
1. Retrieval recall failure, misread as hallucination. When the right passage is absent from the retrieved set, the model has nothing to ground on and fills the gap. The fix is not the model; it is recall. A hallucination complaint should trigger a retrieval audit first, a model investigation second.
2. Chunking that splits the answer. If a chunk boundary falls through the middle of the fact, no single retrieved chunk contains the whole answer, and similarity search returns half-answers. Chunk size and overlap determine whether an answer is retrievable in one piece at all. This is not a cosmetic preprocessing decision. (Barnett et al., Seven Failure Points When Engineering a Retrieval Augmented Generation System, IEEE/ACM CAIN 2024 / arXiv:2401.05856 — failure point "Not in Context," where the passage carrying the answer fails to survive into the consolidated context.)
3. No reranker. First-pass retrieval returns passages ranked by approximate similarity, not by whether they actually answer the specific question. Without a reranker, the most-relevant evidence can sit outside the position the model uses best. Skipping rerank is one of the most common reasons a demo that "retrieves the right thing" degrades on the long tail.
4. Dense-only retrieval fails out-of-domain. BM25 (lexical) retrieval remains a strong baseline and frequently outperforms dense retrievers on out-of-domain queries — in-domain evaluation is not a reliable predictor of out-of-domain performance (Thakur et al., BEIR, NeurIPS 2021 / arXiv:2104.08663). Hybrid retrieval — semantic plus keyword, then rerank — is the production default: keyword catches codes, proper nouns, and acronyms that embeddings blur; semantic catches the paraphrases keyword misses.
5. Position failure — the right chunk reached the model, and still wasn't used. Performance on multi-document QA degrades by more than 20 percentage points when the relevant passage sits in the middle of a long context, a U-shaped curve that persists in long-context models (Liu et al., Lost in the Middle, TACL 2023 / arXiv:2307.03172). Adding more chunks to a long context window is not a substitute for precise retrieval and rerank: Databricks AI Research confirmed that most models' RAG performance declines past a context-length threshold (Leng et al., arXiv:2411.03538, November 2024). Retrieval hit-rate is not answer-faithfulness.
6. No abstention. A system not instructed to say "I don't know" when retrieved evidence is absent will generate anyway, from thin or absent context. A confident wrong answer is the expensive failure in production; "the documents don't cover this" is often the more valuable behavior. Abstention is a designed capability — a retrieval-score threshold below which the system declines — not a default.
7. Citations that don't survive the click. An answer can name a source that doesn't support the attributed claim. The discipline that catches this is a faithfulness metric: decompose the answer into atomic claims and check each against the retrieved evidence. RAG reduces hallucination; it does not eliminate it. A corpus of nearly 18,000 naturally-generated RAG responses, span-annotated at the word level, found that even when relevant context was retrieved and placed in front of the model, LLMs still presented unsupported or contradictory claims relative to the retrieved content (Niu et al., RAGTruth, ACL 2024 / arXiv:2401.00396). Faithfulness is a measured number, not a vibe. Stale or ungoverned source data is a retrieval-reliability failure before it is a model failure.
8. No eval harness, and ignored access control. Without a standing harness, every change to the chunker, embedding model, top-k, reranker, or prompt is an unmeasured bet. And access control is a retrieval-reliability failure, not just a security concern: if the retriever surfaces documents a given user is not permitted to see, the system is reliably retrieving the wrong thing. The permission model must be enforced in the retriever — before ranking — not assumed by the language model.
The synthesis: in production RAG, the failure is almost never the model. It is recall, chunking, position, faithfulness, abstention, drift, or access. Every one is measurable. "The AI is hallucinating" is the symptom; the diagnosis is in the retrieval pipeline.
How do you measure RAG reliability?
RAG reliability requires two measurement tracks — retrieval quality (did the right passages come back?) and answer-groundedness (is the response supported by what was actually retrieved?) — measured continuously against a held-out evaluation set, not once at launch.
A RAG system has two failure surfaces with two different fixes:
- If retrieval recall is low, the evidence never arrived. No generation improvement will help. Fix chunking, go hybrid, add rerank, raise top-k, fix index freshness.
- If retrieval recall is high but faithfulness is low, the right evidence reached the model and the generation drifted from it. Fix the prompt, address position effects, or add a faithfulness gate. A better embedding model will not help.
A single blended accuracy score hides which surface is broken. The production move is two dashboards, two gates. This is the same measure-don't-vibe-check discipline applied to agent outputs and to AI systems more broadly via the standing eval harness — here, pointed at retrieval.
Retrieval quality metrics: - Recall@K: of the passages that contain the answer, how many appeared in the top K retrieved? - Context precision: of the top K retrieved, how many are actually relevant? Low precision means the context is noisy, which degrades answer quality even when recall is high. - MRR (Mean Reciprocal Rank): is the first relevant passage near the top of the retrieved set? - Rerank quality: does re-ranking the first-pass candidates move the most-relevant passage to the top position, measured against the evaluation set?
(Recall@K, MRR, and rerank quality are standard IR metrics; context precision is one of RAGAS's four dimensions — faithfulness, answer relevancy, context precision, context recall — per Es et al., RAGAS, EACL 2024 / arXiv:2309.15217.)
Answer-groundedness (faithfulness): - Does each claim in the generated response trace to a retrieved passage? - A system grounded by default generates only from what was retrieved, and declines when that retrieved evidence is insufficient. - Citation integrity: does the cited passage actually contain the claim attributed to it?
Vectara's faithfulness work, benchmarking across leading models, confirms that LLMs still frequently introduce unsupported claims even when provided with relevant context (Tamber et al., Benchmarking LLM Faithfulness in RAG with Evolving Leaderboards, EMNLP Industry 2025 / arXiv:2505.04847). This is the reason faithfulness requires its own metric, measured independently of retrieval recall.
Groundedness as an outcome signal — as distinct from a green pipeline — is the same distinction covered in production AI liveness vs outcome.
How to measure RAG reliability: a working checklist
- Build a held-out question set from real, production-shaped questions — including the messy, the long-tail, and the questions no one tried in the demo — each paired with the passage(s) that should answer it. Freeze and version it. The set is the ruler; if it drifts with the system, the score is meaningless.
- Measure retrieval first, in isolation. Recall@K, context precision, and MRR on the evaluation set. This is the number that catches the hallucination that is really a retrieval-recall failure.
- Measure answer faithfulness separately. Decompose each answer into claims; score the fraction the retrieved evidence supports. Automate this check using a grounding-verification step.
- Measure abstention behavior. On questions whose answer is not in the corpus, does the system decline rather than confabulate? Track the false-answer rate on out-of-corpus questions as its own metric.
- Set thresholds and gate on them in CI. A number with no release gate is a dashboard, not a control. Every change to chunker, embedding model, top-k, reranker, or prompt re-runs retrieval recall, faithfulness, and abstention — and blocks the release on regression. This is the eval-harness mechanics applied to retrieval.
- Re-validate as the corpus changes. Documents are added, edited, deprecated; the retriever's behavior shifts. Re-sample the evaluation set from current traffic, re-check faithfulness against current sources, and re-freeze. Stale or ungoverned source data is a retrieval-reliability failure upstream of the model.
How do you keep a RAG system reliable over time?
The primary threat to a production RAG system is corpus drift — documents are updated, added, and deleted, and retrieval quality measured at launch no longer reflects current behavior — so the discipline is a standing eval harness, not a one-time green build.
A knowledge base that was clean and well-chunked at launch accumulates stale, conflicting, and missing documents. Retrieval quality degrades without any change to the model or retrieval code. Barnett et al., studying RAG failure points across three domains, found that "validation of a RAG system is only feasible during operation" and that "robustness evolves rather than is designed in at the start" (Seven Failure Points When Engineering a Retrieval Augmented Generation System, IEEE/ACM CAIN 2024 / arXiv:2401.05856). The implication is that a launched RAG system requires ongoing calibration — not a one-time assessment.
Re-evaluation is triggered by: - Bulk document updates or index refreshes - Shifts in the live query distribution (new user question types that the frozen set no longer represents) - Time-based schedule regardless of corpus changes
Grounded-by-default as a reliability property. A system that abstains when evidence is absent degrades gracefully — it surfaces the gap for remediation rather than generating confidently from stale context. An abstention is a quality signal, not a failure.
Access control as a reliability concern. If the permission model changes — users added, documents reclassified — the retrieval layer must reflect those changes. Evaluation should include permission-boundary test cases: can a user retrieve documents they are not entitled to? A retrieval layer that ignores entitlements will, reliably, surface the wrong thing for some users.
The signal being watched. The metric is not "is the pipeline green" — that is liveness, not outcome. It is: are retrieval recall, context precision, and answer-groundedness above threshold on the current corpus against the current evaluation set? See the standing eval-harness infrastructure and production AI liveness vs outcome for how this posture is applied across the delivery method described in how we work.
RAG in a demo vs RAG in production
The gap between a RAG demo and a production RAG system is not model capability — it is the engineering disciplines that make retrieval measurably reliable at the scale and diversity of real use.
This table is distinct from the RAG vs fine-tuning vs long-context comparison on our overview page. That table is about approach selection; this one is about production readiness.
| Dimension | RAG in a demo | RAG in production |
|---|---|---|
| Retrieval quality | Graded on the handful of queries the team tried | Measured recall and context precision on a held-out evaluation set covering the real query distribution, including the long tail |
| Faithfulness | "Looks grounded" — assessed by eye | Scored: each claim traced to a retrieved passage; citations that survive the click; grounding check automated |
| Abstention | System answers every query, including those beyond the available evidence | System declines when retrieved evidence is absent or below a confidence threshold |
| Corpus drift | Static at demo time; evaluation not revisited | Corpus changes trigger re-evaluation; reliability is a continuously watched signal |
| Access control | Single user or open corpus; permissions not tested | Retrieval respects the permission model; evaluation includes permission-boundary cases |
| Reliability signal | A green pipeline — it ran without error | Retrieval recall, context precision, and faithfulness scores above a defined threshold, measured on schedule |
The last row is the page's positioning in miniature: "ran without error" is liveness; "the retrieval and faithfulness numbers held" is outcome.
The missing layer most pilots don't have
If your "chat with your documents" pilot answers confidently but you cannot put a number on how often it is grounded, that is the missing layer. If a retrieval-based system has stalled before reaching production, that is a signal worth diagnosing before rebuilding.
See how NewGenApps ships and proves RAG in how we work, or if a pilot has already stalled, see our AI rescue work. Book a 30-minute working session to talk through what is missing in your specific setup.
NewGenApps — production AI, proven. Stay a step ahead, always.