
RAG: Chat With Your Documents, With Receipts
In short: Retrieval-augmented generation (RAG) lets a language model answer from your documents instead of its training data — retrieving the relevant passages at query time and answering grounded in that evidence.
Done right, every answer carries a citation back to the source. Done wrong, you get a confident chatbot that quietly makes things up. NewGenApps builds the first kind, on Anthropic's Claude, and proves it in a three-week POC.
Most enterprises don't have an AI problem. They have a "the answer is in one of these 40,000 PDFs and nobody can find it" problem. RAG turns that pile into something you can ask and trust.
What RAG actually is
The model is handed information retrieved from your knowledge source — documents, database, wiki — and answers using only that. Retrieval happens at the moment of the question, not during training.
That one choice makes RAG the default for "chat with your documents": the model stays current, you control what it sees, and you can show the exact passage each claim came from. The alternative — asking a model cold and hoping its training data holds the right answer — is a non-starter for anything proprietary, time-sensitive, or auditable.
RAG vs fine-tuning vs long context
These three get confused constantly, and choosing wrong is the most expensive mistake we see.
| Approach | What it changes | Best when | Watch out for |
|---|---|---|---|
| RAG | What the model knows at query time | Knowledge changes often, must be cited | Retrieval quality is the whole game |
| Fine-tuning | How the model behaves — tone, format | You need consistent style, stable knowledge | Adds no fresh facts |
| Long context | How much you paste in at once | Corpus fits the window, rarely changes | Cost scales with every token |
The best systems combine them: RAG for the facts, a tuned prompt for behavior, long context for generous evidence. Claude's large context window often makes a lightweight "stuff the docs in" approach viable for a focused corpus — sometimes that beats a heavyweight RAG stack you didn't need.
Knowing when not to build RAG is part of the job. Our Claude-native AI practice makes that call with you.
The architecture of a grounded system
A production RAG system is a pipeline, not a single API call:
- Ingestion & parsing. Pull documents from their sources; extract clean text, including scanned PDFs, tables, and slides.
- Chunking. Split into self-contained passages. Chunking quietly determines half your answer quality.
- Embedding & indexing. Convert chunks into vectors for fast semantic search, alongside keyword search.
- Retrieval. Blend semantic and keyword search (hybrid), then rerank so the strongest evidence rises to the top.
- Generation. Hand the question plus passages to Claude, instructed to answer only from the evidence and cite each claim.
- Citation & guardrails. Return links to sources, and say "I don't know" when the evidence isn't there.
- Evaluation. Measure retrieval accuracy and answer faithfulness continuously — quality degrades silently as documents change.
The strongest systems aren't AI magic; they're data engineering. We've built data pipelines and search since before "RAG" had a name — see our big-data foundations.
Most "the AI is hallucinating" complaints are retrieval failures wearing a costume. If the right passage never reaches Claude, no model can answer correctly.
Common failure modes
When a "chat with your documents" project disappoints, it's almost always one of these — and almost never the model:
- Bad retrieval, blamed on the model. The right passage never reaches Claude.
- Chunking that splits the answer. A table severed from its header — indexed, but useless.
- No reranking. Vector search buries the best evidence below the cutoff.
- No "I don't know." A system that always answers will confidently answer what your docs don't cover.
- Citations that don't check out. A citation the user can't click and verify is theater.
- No evaluation harness. Quality erodes invisibly; you find out from an angry user, not a dashboard.
- Ignored access control. Surfacing a passage the asker shouldn't see is a data-leak incident.
Book a 30-minute working session →
How NewGenApps ships RAG on Claude
We don't do "plug in a vector DB and see." We ship RAG the way you'd ship anything that faces real users and auditors:
- Grounded by default. Every answer cites its sources and declines when the documents don't cover it. Trust is the product.
- Hybrid retrieval, reranked. Semantic plus keyword with a rerank pass — the exact account number and the fuzzy concept both matter.
- Built on Claude. Its context window passes generous evidence; its instruction-following holds the guardrails. (To be clear: we're an independent firm building toward formal Anthropic credentials, not an official partner.)
- Evaluated, not vibes-checked. A harness scores retrieval accuracy and faithfulness from day one. Quality is a number you watch.
- Access-aware. Retrieval respects your permission model.
- POC to production in weeks. Our Claude POC in 3 Weeks turns one real document-search problem into a working, cited assistant — then we take it to production.
This is the same builder-not-pundit posture that's run through our work since 2008: pick the wave early, then actually ship it.
Ready to chat with your documents and trust the answers? Book a 30-minute AI working session — no deck, no pitch, just where RAG would actually pay off.