theLLMs

Last checked: 2026-05-24

Scope: Global. RAG architecture guidance, provider embedding/retrieval docs, and eval framework documentation checked on 2026-05-24. Specific tool choices and provider APIs evolve rapidly.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Building a minimum viable RAG system without overengineering

Retrieval-augmented generation (RAG) is the most common way to make an LLM useful on your own data. It is also the most common way to accidentally build a slow, expensive, hard-to-debug system that returns worse results than just prompting the model.

The gap is not in the concept. The gap is that most RAG tutorials skip the boring parts: what happens when retrieval returns nothing, when the context window overflows, when the chunking strategy hides the answer across two chunks, or when the model cites a source that does not say what the model claims it says.

Editor’s Note: A working RAG prototype takes an afternoon. A production RAG system that actually beats zero-shot prompting on accuracy takes weeks of measurement and iteration. Do not confuse the first with the second. Editor’s Note: The most common RAG failure is not bad retrieval — it is that the answer does not need retrieval at all. Always measure whether RAG improves accuracy over a well-prompted baseline before committing to the architecture.

Quick answer

Build RAG in six stages, and do not move to the next until the current one produces measurable improvement over a no-RAG baseline:

  1. Ingest — chunk your documents with a repeatable strategy
  2. Retrieve — return the most relevant chunks for a query
  3. Answer — generate a response grounded in those chunks
  4. Cite — show which chunk supports which claim
  5. Evaluate — measure whether retrieval+answer beats the baseline
  6. Monitor — track retrieval precision, citation accuracy, and user feedback in production

Most teams skip steps 5 and 6. That is why most RAG systems are demo-quality.

What the tutorials skip

Chunking strategy is not trivial. Fixed-size chunks with overlap work for simple documents but fail for structured content, tables, code blocks, or documents where context spans chunk boundaries. Semantic chunking (splitting at natural boundaries) improves retrieval accuracy but adds complexity. The right choice depends on your document types, not on what is fashionable.

Retrieval quality is the bottleneck. A perfect LLM cannot fix bad retrieval. If the top-3 chunks do not contain the answer, the model will either hallucinate or say “I don’t know”. Measure retrieval recall (does the answer appear in the top-k chunks?) before adding a second retrieval stage, reranker, or hybrid search.

Chunk stuffing is real. If you retrieve too many chunks, the context window fills with irrelevant text and answer quality drops. If you retrieve too few, the answer may be incomplete. The sweet spot depends on chunk size, document complexity, and model context length — and it changes when you change any of those variables.

Citation granularity matters. Citing the right chunk is not the same as citing the right sentence within the chunk. A chunk might contain the relevant fact alongside irrelevant or contradictory information. Citation accuracy (does the cited text actually support the claim?) is a separate metric from retrieval accuracy.

Where teams misuse RAG

RAG for everything. Many teams reach for RAG when a simpler solution would work: a well-structured system prompt with relevant context, a fine-tuned model, or a traditional search index. RAG adds latency, cost, and failure modes that are only justified when the knowledge base is too large or too dynamic to include in the prompt.

RAG without evaluation. Building a RAG pipeline without a test set is building blind. You need a set of query-answer pairs with known correct chunks to measure retrieval recall, answer accuracy, and citation precision. Without this, you cannot tell whether changes improve or degrade the system.

Production without monitoring. RAG systems drift: documents get added, removed, or updated; embedding models get deprecated; user queries change. Without production monitoring of retrieval quality and answer accuracy, you will not notice when the system degrades.

Practical build stages

Stage 1 — Ingest and chunk

Start with a simple strategy: split documents into chunks of 500–1000 tokens with 100–200 token overlap. Use a document parser that preserves structure (headings, lists, tables) where possible. Store each chunk with metadata: source document, section heading, chunk index, and character offset.

Test on at least 10 representative documents before deciding on a chunking strategy. Measure what percentage of known answer-queries retrieve the correct chunk in the top-3.

Stage 2 — Retrieve

Choose an embedding model and vector store. Start with a managed provider embedding (text-embedding-3-small, voyage-2, or similar) and a simple cosine-similarity vector search. Add hybrid search (BM25 + vector) only if pure vector search fails on exact-match queries.

Measure top-1, top-3, and top-5 retrieval recall on your test set. If recall is below 70% on top-3, fix chunking or retrieval before moving to the answer stage.

Stage 3 — Answer and cite

Construct the prompt to include the retrieved chunks and ask the model to cite which chunk supports each claim. A simple format works: “Based on the following sources, answer the question. For each claim, cite the source number in brackets.”

Measure answer accuracy (does the response contain correct claims?) and citation precision (does the cited source actually support the claim?). Both should exceed 80% before considering production.

Stage 4 — Evaluate and iterate

Build a regression test set of 50–100 query-answer-source triples. Run it after every change to chunking, embedding, retrieval, prompt, or model. Track:

  • Retrieval recall at top-3
  • Answer accuracy (human-rated or LLM-as-judge)
  • Citation precision
  • End-to-end latency

Decision framework

QuestionIf yesIf no
Is the knowledge base small enough to fit in a prompt?Skip RAG, use system promptBuild RAG
Do you have 50+ test query-answer-source triples?Proceed to evaluateBuild test set first
Does top-3 retrieval recall exceed 70%?Proceed to answer stageFix chunking or embedding
Does answer accuracy exceed 80%?Consider productionIterate on prompt or retrieval
Do you have production monitoring planned?Go livePlan monitoring first

Methodology and sources

This guide draws on published RAG evaluation frameworks (RAGAS, ARES, RGB), operational guidance from teams running production RAG systems, provider embedding and retrieval documentation, and academic literature on chunking and retrieval strategies.

Change log

2026-05-24 — First published version.

Source list