Hallucination testing: how to build a small regression set
A hallucination test set is a small collection of prompts and expected answers that helps you catch unsupported claims before they reach users.
The point is not to build a perfect benchmark. The point is to keep a repeatable set of cases that reflect the work the product actually does, so you can see when a change makes the model more confident, less faithful or more likely to invent details.
Quick answer
If your LLM output matters to users, keep a small regression set that you can run after prompt, model or retrieval changes.
The set should include real user tasks, known hard cases and examples where the correct answer is “I do not know” or “not enough evidence”. That makes it harder for a shiny new model to look better than it is.
What belongs in the set
A useful small set usually includes:
- common, ordinary user questions;
- edge cases that have failed before;
- prompts that tempt the model to guess;
- prompts with missing context;
- prompts where refusal or uncertainty is the correct outcome;
- prompts that need grounding in a source, policy or database record.
If every test case is easy, the set will not catch the regressions that matter.
How to build it
A practical build process looks like this:
- collect real examples from support, logs or product reviews;
- remove sensitive data;
- write the expected answer or expected behaviour;
- mark which source or rule the answer depends on;
- decide what counts as a pass, a partial pass or a fail;
- keep the set small enough that people will actually run it.
A tiny set that is used regularly beats a huge set that sits untouched.
What to watch for
Hallucination testing should catch more than obvious factual errors.
Watch for:
- invented details;
- unsupported certainty;
- mixing up similar entities or dates;
- missing citations or source drift;
- unsafe advice where caution was required;
- confident answers to underspecified prompts.
A model can sound polished while still being wrong. The test set should reward correctness, not confidence.
A simple pass metric
One useful planning metric is:
Pass rate = supported responses / total test cases
That number is only useful if the test cases are stable and the pass criteria are written down. Otherwise the score becomes theatre.
Treat the metric as a trend, not a verdict.
Practical decision check
Before you rely on a regression set, ask:
- Are these real tasks or toy examples?
- Do we have a written pass/fail rule?
- Are the hardest cases actually represented?
- Can the set catch the failure modes we care about?
- Will the team run it after changes?
- Is there a human review path for uncertain cases?
If the answer to any of those is no, the set is probably too weak.
What this page cannot tell you
This page cannot tell you the exact right size for your regression set.
It cannot tell you:
- which prompts to include for your specific business;
- what pass threshold you should use;
- whether the model is safe enough for a legal, medical or financial workflow;
- whether your source of truth is clean enough;
- whether a failure is caused by the prompt, the retrieval layer or the model itself.
It can only help you build a small guardrail instead of hoping the model behaves.
Global applicability
This article is global. There is no UK, GB or Northern Ireland split to apply here.
The useful caution is universal: if the output can mislead users, test it before you ship changes.
Methodology and sources
Check date: 2026-05-22
What was checked: evaluation-tool documentation and benchmark methodology references.
What the sources were used for:
- building a small, repeatable regression set;
- separating supported answers from confident guesses;
- keeping pass/fail criteria explicit and stable.
Assumptions and limits:
- no regression set can cover every case;
- scores depend on the quality of the expected answer;
- this page does not claim a hands-on local test run;
- safety-critical domains need additional review beyond a small set.
Change log
- 2026-05-22: first draft built from the llm-editor-approved brief, with a small regression-set workflow, a support-vs-guess framing and a simple pass metric.
Source list
- OpenAI Evals repository — https://github.com/openai/evals
- DeepEval documentation — https://docs.confident-ai.com/
- Ragas documentation — https://docs.ragas.io/
- lm-evaluation-harness repository — https://github.com/EleutherAI/lm-evaluation-harness