RAG evaluation: checking retrieval before blaming the model

RAG failures are often blamed on the model when the real problem is the retrieval layer.

That matters because retrieval, chunking, embeddings, reranking and prompt assembly can all fail before the model even has a fair shot. If the answer is bad, you need to know whether the wrong passages were retrieved, the right passages were not retrieved, or the model ignored the evidence.

A good RAG evaluation makes that visible.

TL;DR

If you are building RAG, test retrieval separately from generation.

Start by asking: did the system fetch the right passages, and did it fetch enough of them? Then ask whether the model used those passages correctly. If you skip the first question, you may spend weeks tuning prompts to fix a retrieval problem.

What to evaluate

Useful RAG evaluation usually covers three layers:

retrieval quality: did the system find relevant passages?
context quality: were the retrieved chunks readable, complete and not polluted with junk?
answer quality: did the final answer stay faithful to the retrieved evidence?

Those layers are different. A strong answer can hide weak retrieval. A weak answer can come from a weak retrieval set even if the model itself is fine.

Typical failure modes

Common problems include:

the chunking is too coarse or too tiny;
the embedding model does not fit the language or domain;
the retriever returns near-misses instead of the right evidence — before evaluation, confirm your retrieval architecture choice is appropriate: when semantic search is enough and when it is not covers the architecture decision that shapes everything downstream;
reranking is missing or too weak;
the prompt gives too much irrelevant context;
the model answers from memory instead of the retrieved source;
the system returns a confident answer even when retrieval was poor.

A bad retrieval step can make even a very capable model look unreliable.

What to measure

A practical evaluation set should record:

the question;
the expected evidence source;
the retrieved passages;
whether the evidence was relevant;
whether the answer was faithful to the evidence;
whether the final answer was complete enough for the user.

If you want one simple rule, keep this one: inspect the retrieved context before you inspect the answer.

Practical checks

Before you change the model, check:

Are the right passages being retrieved at all?
Are the retrieved chunks split in a way that preserves meaning?
Is the reranker helping or hurting?
Is the prompt too crowded with irrelevant context?
Are you measuring faithfulness, not just fluency?
Do you have examples where the system should refuse or say “not enough evidence”?

A RAG system that always answers is not necessarily a good RAG system.

What this page cannot tell you

This page cannot tell you which embedding model or vector store is best for your workload.

It cannot tell you:

how many chunks to retrieve in your specific case;
whether your chunk size is ideal;
whether your corpus needs metadata filters or reranking;
whether the model is hallucinating or simply answering a bad question;
whether your user actually needs RAG at all.

It can only help you separate retrieval problems from generation problems.

Methodology

Data checked: 2026-05-28
Sources consulted: RAG evaluation framework documentation (Ragas, DeepEval), OpenAI Evals repository, LangSmith evaluation guides, retrieval quality measurement literature
Assumptions: Evaluation metrics vary by tool and framework. Retrieval quality is domain-specific. This page does not claim hands-on benchmark numbers.
Limitations: A faithful answer can still be unhelpful if the question is poorly defined. This guide covers evaluation design, not specific tool configuration or benchmark results.
Jurisdiction: Global. This is operational guidance applicable regardless of jurisdiction.

Source list

Ragas documentation — https://docs.ragas.io/ (accessed 2026-05-28)
DeepEval documentation — https://docs.confident-ai.com/ (accessed 2026-05-28)
OpenAI Evals repository — https://github.com/openai/evals (accessed 2026-05-28)
LangSmith RAG evaluation guidance — https://docs.smith.langchain.com/ (accessed 2026-05-28)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: Full editorial review against 16-gate checklist: added third Editor’s Note, slugified all H2 IDs, added Trust Stack and proper Methodology sections, added source access dates, fixed <aside> format with class attribute, corrected broken relative guide links, and moved jurisdiction into Methodology.
2026-05-22: First published. Initial draft with retrieval-first evaluation model and clear split between retrieval and generation failures.