Building a minimum viable RAG system without overengineering
Retrieval-augmented generation (RAG) is the most common way to make an LLM useful on your own data. It is also the most common way to accidentally build a slow, expensive, hard-to-debug system that returns worse results than just prompting the model.
The gap is not in the concept. The gap is that most RAG tutorials skip the boring parts: what happens when retrieval returns nothing, when the context window overflows, when the chunking strategy hides the answer across two chunks, or when the model cites a source that does not say what the model claims it says.
Editor’s Note: A working RAG prototype takes an afternoon. A production RAG system that actually beats zero-shot prompting on accuracy takes weeks of measurement and iteration. Do not confuse the first with the second. Editor’s Note: The most common RAG failure is not bad retrieval — it is that the answer does not need retrieval at all. Always measure whether RAG improves accuracy over a well-prompted baseline before committing to the architecture.
Quick answer
Build RAG in six stages, and do not move to the next until the current one produces measurable improvement over a no-RAG baseline:
- Ingest — chunk your documents with a repeatable strategy
- Retrieve — return the most relevant chunks for a query
- Answer — generate a response grounded in those chunks
- Cite — show which chunk supports which claim
- Evaluate — measure whether retrieval+answer beats the baseline
- Monitor — track retrieval precision, citation accuracy, and user feedback in production
Most teams skip steps 5 and 6. That is why most RAG systems are demo-quality.
What the tutorials skip
Chunking strategy is not trivial. Fixed-size chunks with overlap work for simple documents but fail for structured content, tables, code blocks, or documents where context spans chunk boundaries. Semantic chunking (splitting at natural boundaries) improves retrieval accuracy but adds complexity. The right choice depends on your document types, not on what is fashionable.
Retrieval quality is the bottleneck. A perfect LLM cannot fix bad retrieval. If the top-3 chunks do not contain the answer, the model will either hallucinate or say “I don’t know”. Measure retrieval recall (does the answer appear in the top-k chunks?) before adding a second retrieval stage, reranker, or hybrid search.
Chunk stuffing is real. If you retrieve too many chunks, the context window fills with irrelevant text and answer quality drops. If you retrieve too few, the answer may be incomplete. The sweet spot depends on chunk size, document complexity, and model context length — and it changes when you change any of those variables.
Citation granularity matters. Citing the right chunk is not the same as citing the right sentence within the chunk. A chunk might contain the relevant fact alongside irrelevant or contradictory information. Citation accuracy (does the cited text actually support the claim?) is a separate metric from retrieval accuracy.
Where teams misuse RAG
RAG for everything. Many teams reach for RAG when a simpler solution would work: a well-structured system prompt with relevant context, a fine-tuned model, or a traditional search index. RAG adds latency, cost, and failure modes that are only justified when the knowledge base is too large or too dynamic to include in the prompt.
RAG without evaluation. Building a RAG pipeline without a test set is building blind. You need a set of query-answer pairs with known correct chunks to measure retrieval recall, answer accuracy, and citation precision. Without this, you cannot tell whether changes improve or degrade the system.
Production without monitoring. RAG systems drift: documents get added, removed, or updated; embedding models get deprecated; user queries change. Without production monitoring of retrieval quality and answer accuracy, you will not notice when the system degrades.
Practical build stages
Stage 1 — Ingest and chunk
Start with a simple strategy: split documents into chunks of 500–1000 tokens with 100–200 token overlap. Use a document parser that preserves structure (headings, lists, tables) where possible. Store each chunk with metadata: source document, section heading, chunk index, and character offset.
Test on at least 10 representative documents before deciding on a chunking strategy. Measure what percentage of known answer-queries retrieve the correct chunk in the top-3.
Stage 2 — Retrieve
Choose an embedding model and vector store. Start with a managed provider embedding (text-embedding-3-small, voyage-2, or similar) and a simple cosine-similarity vector search. Add hybrid search (BM25 + vector) only if pure vector search fails on exact-match queries.
Measure top-1, top-3, and top-5 retrieval recall on your test set. If recall is below 70% on top-3, fix chunking or retrieval before moving to the answer stage.
Stage 3 — Answer and cite
Construct the prompt to include the retrieved chunks and ask the model to cite which chunk supports each claim. A simple format works: “Based on the following sources, answer the question. For each claim, cite the source number in brackets.”
Measure answer accuracy (does the response contain correct claims?) and citation precision (does the cited source actually support the claim?). Both should exceed 80% before considering production.
Stage 4 — Evaluate and iterate
Build a regression test set of 50–100 query-answer-source triples. Run it after every change to chunking, embedding, retrieval, prompt, or model. Track:
- Retrieval recall at top-3
- Answer accuracy (human-rated or LLM-as-judge)
- Citation precision
- End-to-end latency
Decision framework
| Question | If yes | If no |
|---|---|---|
| Is the knowledge base small enough to fit in a prompt? | Skip RAG, use system prompt | Build RAG |
| Do you have 50+ test query-answer-source triples? | Proceed to evaluate | Build test set first |
| Does top-3 retrieval recall exceed 70%? | Proceed to answer stage | Fix chunking or embedding |
| Does answer accuracy exceed 80%? | Consider production | Iterate on prompt or retrieval |
| Do you have production monitoring planned? | Go live | Plan monitoring first |
Methodology and sources
This guide draws on published RAG evaluation frameworks (RAGAS, ARES, RGB), operational guidance from teams running production RAG systems, provider embedding and retrieval documentation, and academic literature on chunking and retrieval strategies.
- RAGAS evaluation framework: https://docs.ragas.io/ — checked 2026-05-24
- OpenAI text-embedding-3-small documentation: https://platform.openai.com/docs/guides/embeddings — checked 2026-05-24
- Anthropic context window guidance: https://docs.anthropic.com/en/docs/build-with-claude/context-windows — checked 2026-05-24
- Pinecone hybrid search documentation: https://docs.pinecone.io/reference/hybrid-search — checked 2026-05-24
- ARES evaluation framework paper: https://arxiv.org/abs/2311.09476 — checked 2026-05-24
Change log
2026-05-24 — First published version.
Source list
- RAGAS: https://docs.ragas.io/
- OpenAI Embeddings docs: https://platform.openai.com/docs/guides/embeddings
- Anthropic Context Windows: https://docs.anthropic.com/en/docs/build-with-claude/context-windows
- Pinecone Hybrid Search: https://docs.pinecone.io/reference/hybrid-search
- ARES (arXiv): https://arxiv.org/abs/2311.09476