NewGenApps

Retrieval-augmented generation over enterprise data

RAG: Chat With Your Documents, With Receipts

In short: Retrieval-augmented generation (RAG) lets a language model answer from your documents instead of its training data — retrieving the relevant passages at query time and answering grounded in that evidence.

Done right, every answer carries a citation back to the source. Done wrong, you get a confident chatbot that quietly makes things up. NewGenApps builds the first kind, on Anthropic's Claude, and proves it in a three-week POC.

Most enterprises don't have an AI problem. They have a "the answer is in one of these 40,000 PDFs and nobody can find it" problem. RAG turns that pile into something you can ask and trust.

What RAG actually is

The model is handed information retrieved from your knowledge source — documents, database, wiki — and answers using only that. Retrieval happens at the moment of the question, not during training.

That one choice makes RAG the default for "chat with your documents": the model stays current, you control what it sees, and you can show the exact passage each claim came from. The alternative — asking a model cold and hoping its training data holds the right answer — is a non-starter for anything proprietary, time-sensitive, or auditable.

RAG vs fine-tuning vs long context

These three get confused constantly, and choosing wrong is the most expensive mistake we see.

Approach What it changes Best when Watch out for
RAG What the model knows at query time Knowledge changes often, must be cited Retrieval quality is the whole game
Fine-tuning How the model behaves — tone, format You need consistent style, stable knowledge Adds no fresh facts
Long context How much you paste in at once Corpus fits the window, rarely changes Cost scales with every token

The best systems combine them: RAG for the facts, a tuned prompt for behavior, long context for generous evidence. Claude's large context window often makes a lightweight "stuff the docs in" approach viable for a focused corpus — sometimes that beats a heavyweight RAG stack you didn't need.

Knowing when not to build RAG is part of the job. Our Claude-native AI practice makes that call with you.

The architecture of a grounded system

A production RAG system is a pipeline, not a single API call:

  1. Ingestion & parsing. Pull documents from their sources; extract clean text, including scanned PDFs, tables, and slides.
  2. Chunking. Split into self-contained passages. Chunking quietly determines half your answer quality.
  3. Embedding & indexing. Convert chunks into vectors for fast semantic search, alongside keyword search.
  4. Retrieval. Blend semantic and keyword search (hybrid), then rerank so the strongest evidence rises to the top.
  5. Generation. Hand the question plus passages to Claude, instructed to answer only from the evidence and cite each claim.
  6. Citation & guardrails. Return links to sources, and say "I don't know" when the evidence isn't there.
  7. Evaluation. Measure retrieval accuracy and answer faithfulness continuously — quality degrades silently as documents change.

The strongest systems aren't AI magic; they're data engineering. We've built data pipelines and search since before "RAG" had a name — see our big-data foundations.

Most "the AI is hallucinating" complaints are retrieval failures wearing a costume. If the right passage never reaches Claude, no model can answer correctly.

Common failure modes

When a "chat with your documents" project disappoints, it's almost always one of these — and almost never the model:

Book a 30-minute working session →

How NewGenApps ships RAG on Claude

We don't do "plug in a vector DB and see." We ship RAG the way you'd ship anything that faces real users and auditors:

This is the same builder-not-pundit posture that's run through our work since 2008: pick the wave early, then actually ship it.

Ready to chat with your documents and trust the answers? Book a 30-minute AI working session — no deck, no pitch, just where RAG would actually pay off.

Book an AI working session