theLLMs

Last checked: 2026-05-22

Scope: Global. RAG evaluation docs and related source references were checked on 2026-05-22; this is operational guidance, not a vendor ranking.

AI draft model: gpt-5.4-mini

AI review model: llm-editor (deepseek-v4-pro)

RAG evaluation: checking retrieval before blaming the model

RAG failures are often blamed on the model when the real problem is the retrieval layer.

That matters because retrieval, chunking, embeddings, reranking and prompt assembly can all fail before the model even has a fair shot. If the answer is bad, you need to know whether the wrong passages were retrieved, the right passages were not retrieved, or the model ignored the evidence.

A good RAG evaluation makes that visible.

Quick answer

If you are building RAG, test retrieval separately from generation.

Start by asking: did the system fetch the right passages, and did it fetch enough of them? Then ask whether the model used those passages correctly. If you skip the first question, you may spend weeks tuning prompts to fix a retrieval problem.

What to evaluate

Useful RAG evaluation usually covers three layers:

  • retrieval quality: did the system find relevant passages?
  • context quality: were the retrieved chunks readable, complete and not polluted with junk?
  • answer quality: did the final answer stay faithful to the retrieved evidence?

Those layers are different. A strong answer can hide weak retrieval. A weak answer can come from a weak retrieval set even if the model itself is fine.

Typical failure modes

Common problems include:

  1. the chunking is too coarse or too tiny;
  2. the embedding model does not fit the language or domain;
  3. the retriever returns near-misses instead of the right evidence;
  4. reranking is missing or too weak;
  5. the prompt gives too much irrelevant context;
  6. the model answers from memory instead of the retrieved source;
  7. the system returns a confident answer even when retrieval was poor.

A bad retrieval step can make even a very capable model look unreliable.

What to measure

A practical evaluation set should record:

  • the question;
  • the expected evidence source;
  • the retrieved passages;
  • whether the evidence was relevant;
  • whether the answer was faithful to the evidence;
  • whether the final answer was complete enough for the user.

If you want one simple rule, keep this one: inspect the retrieved context before you inspect the answer.

Practical checks

Before you change the model, check:

  • Are the right passages being retrieved at all?
  • Are the retrieved chunks split in a way that preserves meaning?
  • Is the reranker helping or hurting?
  • Is the prompt too crowded with irrelevant context?
  • Are you measuring faithfulness, not just fluency?
  • Do you have examples where the system should refuse or say “not enough evidence”?

A RAG system that always answers is not necessarily a good RAG system.

What this page cannot tell you

This page cannot tell you which embedding model or vector store is best for your workload.

It cannot tell you:

  • how many chunks to retrieve in your specific case;
  • whether your chunk size is ideal;
  • whether your corpus needs metadata filters or reranking;
  • whether the model is hallucinating or simply answering a bad question;
  • whether your user actually needs RAG at all.

It can only help you separate retrieval problems from generation problems.

Global applicability

This article is global. There is no UK, GB or Northern Ireland split to apply here.

The useful caution is universal: if retrieval is broken, generation metrics can mislead you.

Methodology and sources

Check date: 2026-05-22

What was checked: RAG evaluation documentation, retrieval quality references and evaluation tooling docs.

What the sources were used for:

  • the distinction between retrieval, context quality and answer faithfulness;
  • common failure modes in chunking, embeddings and reranking;
  • the need to inspect retrieved evidence before judging answer quality.

Assumptions and limits:

  • evaluation metrics vary by tool and framework;
  • retrieval quality is domain-specific;
  • this page does not claim hands-on benchmark numbers;
  • a faithful answer can still be unhelpful if the question is poorly defined.

Change log

  • 2026-05-22: first draft built from the llm-editor-approved brief, with a retrieval-first evaluation model and a clear split between retrieval and generation failures.

Source list