#- Hallucination- testing:- how- to- build- a- small- regression- set

A- hallucination- test- set- is- a- small- collection- of- prompts- and- expected- answers- that- helps- you- catch- unsupported- claims- before- they- reach- users.

The- point- is- not- to- build- a- perfect- benchmark.- The- point- is- to- keep- a- repeatable- set- of- cases- that- reflect- the- work- the- product- actually- does,- so- you- can- see- when- a- change- makes- the- model- more- confident,- less- faithful- or- more- likely- to- invent- details.

- - Editor's- Note - -

"Hallucination"- is- a- useful- shorthand,- but- the- real- issue- is- usually- unsupported- or- ungrounded- claims.- If- the- same- prompt- starts- failing- after- a- change,- the- regression- set- has- done- its- job- —- it- caught- drift- before- users- did.

- - Editor's- Note - -

If- every- test- case- in- your- set- is- easy,- the- set- will- not- catch- the- regressions- that- matter.- Include- at- least- 20%- of- cases- where- the- correct- answer- is- "I- don't- know,"- "not- enough- evidence,"- or- a- refusal.- A- model- that- answers- everything- confidently- will- look- good- on- easy- cases- and- fail- silently- on- the- hard- ones.

##- TL;DR

If- your- LLM- output- matters- to- users,- keep- a- small- regression- set- that- you- can- run- after- prompt,- model- or- retrieval- changes.

The- set- should- include- real- user- tasks,- known- hard- cases- and- examples- where- the- correct- answer- is- “I- do- not- know”- or- “not- enough- evidence”.- That- makes- it- harder- for- a- shiny- new- model- to- look- better- than- it- is.

##- What- belongs- in- the- set

A- useful- small- set- usually- includes:

common,- ordinary- user- questions; edge- cases- that- have- failed- before; prompts- that- tempt- the- model- to- guess; prompts- with- missing- context; prompts- where- refusal- or- uncertainty- is- the- correct- outcome; prompts- that- need- grounding- in- a- source,- policy- or- database- record.

If- every- test- case- is- easy,- the- set- will- not- catch- the- regressions- that- matter.

##- How- to- build- it

A- practical- build- process- looks- like- this:

1.- collect- real- examples- from- support,- logs- or- product- reviews; 2.- remove- sensitive- data; 3.- write- the- expected- answer- or- expected- behaviour; 4.- mark- which- source- or- rule- the- answer- depends- on; 5.- decide- what- counts- as- a- pass,- a- partial- pass- or- a- fail; 6.- keep- the- set- small- enough- that- people- will- actually- run- it.

A- tiny- set- that- is- used- regularly- beats- a- huge- set- that- sits- untouched.

##- What- to- watch- for

Hallucination- testing- should- catch- more- than- obvious- factual- errors.

Watch- for:

invented- details; unsupported- certainty; mixing- up- similar- entities- or- dates; missing- citations- or- source- drift; unsafe- advice- where- caution- was- required; confident- answers- to- underspecified- prompts.

A- model- can- sound- polished- while- still- being- wrong.- The- test- set- should- reward- correctness,- not- confidence.

##- A- simple- pass- metric

One- useful- planning- metric- is:

Pass- rate- =- supported- responses- /- total- test- cases

That- number- is- only- useful- if- the- test- cases- are- stable- and- the- pass- criteria- are- written- down.- Otherwise- the- score- becomes- theatre- —- the- same- risk- that- makes- LLM-as-a-judge- grading- unreliable- without- a- written- rubric.

Treat- the- metric- as- a- trend,- not- a- verdict.

##- Practical- decision- check

Before- you- rely- on- a- regression- set,- ask:

Are- these- real- tasks- or- toy- examples? Do- we- have- a- written- pass/fail- rule? Are- the- hardest- cases- actually- represented? Can- the- set- catch- the- failure- modes- we- care- about? Will- the- team- run- it- after- changes? Is- there- a- human- review- path- for- uncertain- cases?

If- the- answer- to- any- of- those- is- no,- the- set- is- probably- too- weak.

- - Editor's- Note - -

The- question- "will- the- team- run- it- after- changes?"- is- the- one- that- determines- whether- the- set- is- a- safety- net- or- a- checkbox.- If- running- the- set- takes- more- than- 5- minutes,- include- a- subset- of- 10–15- highest-risk- cases- that- can- be- run- in- under- a- minute.- The- fast- subset- gets- used;- the- full- set- gets- skipped.

##- Caveats- and- scope- boundaries

No- regression- set- can- cover- every- case.- Scores- depend- on- the- quality- of- the- expected- answer. This- page- provides- operational- guidance- for- building- a- small- regression- set.- It- does- not- cover- comprehensive- benchmark- methodology- or- safety-critical- domain- requirements. Safety-critical- domains- (legal,- medical,- financial)- need- additional- review- beyond- a- small- regression- set. This- page- cannot- tell- you- the- exact- right- size- for- your- set,- which- prompts- to- include- for- your- specific- business,- or- what- pass- threshold- to- use.- It- can- only- help- you- build- a- small- guardrail- instead- of- hoping- the- model- behaves.

##- Methodology

Data- checked:- 2026-05-28 Sources- consulted:- OpenAI- Evals- repository,- DeepEval- documentation,- Ragas- documentation,- lm-evaluation-harness- repository Assumptions:- The- reader- operates- an- LLM- feature- and- needs- a- practical,- maintainable- regression-testing- workflow Limitations:- This- article- covers- small-scale- regression- testing- for- hallucination- detection,- not- comprehensive- benchmark- evaluation- or- safety- assurance- for- regulated- domains Jurisdiction:- Global.- No- jurisdiction-specific- regulatory- content

##- Source- list

OpenAI- Evals- repository- —- https://github.com/openai/evals- (accessed- 2026-05-28) DeepEval- documentation- —- https://docs.confident-ai.com/- (accessed- 2026-05-28) Ragas- documentation- —- https://docs.ragas.io/- (accessed- 2026-05-28) lm-evaluation-harness- repository- —- https://github.com/EleutherAI/lm-evaluation-harness- (accessed- 2026-05-28)

##- Related- guides- guides

How- LLM- benchmarks- work,- and- what- they- miss lm-eval-harness- explained- for- non-researchers RAG- evaluation:- checking- retrieval- before- blaming- the- model Structured- outputs- and- JSON- mode:- reliability- limits

##- Trust- Stack

Last- checked:- 2026-05-28 Corrections:- Contact- us- to- report- errors

##- Change- log

2026-05-28:- Full- editorial- review- against- 16-gate- checklist.- Added- 3- Editor’s- Note- asides- (fixed- class- attribute- on- existing,- added- 1).- Added- Methodology,- Source- list- with- access- dates,- Trust- Stack,- slugified- heading- IDs,- and- Caveats- section.- Fixed- frontmatter- writtenBy- label.- Folded- Global- applicability- and- What- this- page- cannot- tell- you- into- Caveats. 2026-05-22:- First- draft- built- from- editorial- brief,- with- regression-set- workflow,- support-vs-guess- framing,- and- simple- pass- metric.