Retrieval-augmented generation over enterprise data

RAG: Chat With Your Documents, With Receipts

Q: What RAG actually is

The model is handed information retrieved from your knowledge source — documents, database, wiki — and answers using only that. Retrieval happens at the moment of the question, not during training. That one choice makes RAG the default for "chat with your documents": the model stays current, you control what it sees, and you can show the exact passage each claim came from. The alternative — asking a model cold and hoping its training data holds the right answer — is a non-starter for anything proprietary, time-sensitive, or auditable.

In short: Retrieval-augmented generation (RAG) lets a language model answer from your documents instead of its training data — retrieving the relevant passages at query time and answering grounded in that evidence.

Done right, every answer carries a citation back to the source. Done wrong, you get a confident chatbot that quietly makes things up. NewGenApps builds the first kind and proves it in a three-week POC — we don't just say it, we show it.

Most enterprises don't have an AI problem. They have a "the answer is in one of these 40,000 PDFs and nobody can find it" problem. RAG turns that pile into something you can ask and trust.

What RAG actually is

The model is handed information retrieved from your knowledge source — documents, database, wiki — and answers using only that. Retrieval happens at the moment of the question, not during training.

That one choice makes RAG the default for "chat with your documents": the model stays current, you control what it sees, and you can show the exact passage each claim came from. The alternative — asking a model cold and hoping its training data holds the right answer — is a non-starter for anything proprietary, time-sensitive, or auditable.

RAG vs fine-tuning vs long context

These three get confused constantly, and choosing wrong is the most expensive mistake we see.

Approach	What it changes	Best when	Watch out for
RAG	What the model knows at query time	Knowledge changes often, must be cited	Retrieval quality is the whole game
Fine-tuning	How the model behaves — tone, format	You need consistent style, stable knowledge	Adds no fresh facts
Long context	How much you paste in at once	Corpus fits the window, rarely changes	Cost scales with every token

The best systems combine them: RAG for the facts, a tuned prompt for behavior, long context for generous evidence. A large context window often makes a lightweight "stuff the docs in" approach viable for a focused corpus — sometimes that beats a heavyweight RAG stack you didn't need.

Knowing when not to build RAG is part of the job. See how we build — we make that call with you.

The architecture of a grounded system

A production RAG system is a pipeline, not a single API call:

Ingestion & parsing. Pull documents from their sources; extract clean text, including scanned PDFs, tables, and slides.
Chunking. Split into self-contained passages. Chunking quietly determines half your answer quality.
Embedding & indexing. Convert chunks into vectors for fast semantic search, alongside keyword search.
Retrieval. Blend semantic and keyword search (hybrid), then rerank so the strongest evidence rises to the top.
Generation. Hand the question plus passages to the generation model, instructed to answer only from the evidence and cite each claim.
Citation & guardrails. Return links to sources, and say "I don't know" when the evidence isn't there.
Evaluation. Measure retrieval accuracy and answer faithfulness continuously — quality degrades silently as documents change.

The strongest systems aren't AI magic; they're data engineering. We've built data pipelines and search since before "RAG" had a name — see our big-data foundations.

Most "the AI is hallucinating" complaints are retrieval failures wearing a costume. If the right passage never reaches the generation model, no model can answer correctly.

Common failure modes

When a "chat with your documents" project disappoints, it's almost always one of these — and almost never the model:

Bad retrieval, blamed on the model. The right passage never reaches the generation model.
Chunking that splits the answer. A table severed from its header — indexed, but useless.
No reranking. Vector search buries the best evidence below the cutoff.
No "I don't know." A system that always answers will confidently answer what your docs don't cover.
Citations that don't check out. A citation the user can't click and verify is theater.
No evaluation harness. Quality erodes invisibly; you find out from an angry user, not a dashboard.
Ignored access control. Surfacing a passage the asker shouldn't see is a data-leak incident.

Book a 30-minute working session →

How NewGenApps ships RAG

We don't do "plug in a vector DB and see." We ship RAG the way you'd ship anything that faces real users and auditors:

Grounded by default. Every answer cites its sources and declines when the documents don't cover it. Trust is the product.
Hybrid retrieval, reranked. Semantic plus keyword with a rerank pass — the exact account number and the fuzzy concept both matter.
Built with Claude Code and our skills library. That's our actual engineering depth — the agent-driven workflow that lets a small senior team ship and verify a grounded system fast. The generation model stays swappable: we build on Amazon Bedrock, so a large-context, instruction-following model plugs in without a rewrite when the corpus calls for it. (To be clear: we're an independent firm building toward formal Anthropic credentials, not an official partner.)
Evaluated, not vibes-checked. A harness scores retrieval accuracy and faithfulness from day one. Quality is a number you watch.
Access-aware. Retrieval respects your permission model.
POC to production in weeks. Our POC in 3 Weeks turns one real document-search problem into a working, cited assistant — then we take it to production.

This is the same builder-not-pundit posture that's run through our work since 2008: pick the wave early, then actually ship it.

Ready to chat with your documents and trust the answers? Book a 30-minute AI working session — no deck, no pitch, just where RAG would actually pay off.

Book an AI working session