hero_image:- “/images/hero/eval-gaming-when-models-optimise-for-the-test-rather-than-the-task.png” layout:- ../../layouts/GuideLayout.astro title:- “Eval- gaming:- when- models- optimise- for- the- test- rather- than- the- task” description:- “A- guide- to- spotting- benchmark- overfitting- and- test-specific- behaviour- before- it- turns- into- product- disappointment.” writtenBy:- “gemma4:26b” reviewedBy:- “deepseek-r1:32b” lastChecked:- “2026-05-28” scope:- “Global.- Evaluation- and- benchmark- documentation- was- checked- on- 2026-05-28;- this- page- is- operational- guidance,- not- a- leaderboard- claim.”

#- Eval- gaming:- when- models- optimise- for- the- test- rather- than- the- task

When- a- model- does- well- on- a- benchmark- but- disappoints- in- production,- you- may- be- looking- at- eval- gaming.- The- system- learned- how- to- look- good- on- the- test,- not- how- to- do- the- job- users- actually- care- about.

Benchmarks- can- be- useful- and- still- be- gameable.- If- the- test- predicts- only- the- test,- it- is- helping- you- less- than- you- think.

##- TL;DR

Distinguish- between- benchmark- performance- and- real-task- performance- from- day- one.- Build- a- holdout- test- set- of- real- user- scenarios- that- is- never- used- for- tuning.- Run- it- alongside- every- benchmark- run.- If- benchmark- scores- climb- but- the- holdout- set- stays- flat- —- or- gets- worse- —- you- are- looking- at- eval- gaming.

- - - -

The- golden- test- set- is- the- single- most- important- defence- against- eval- gaming,- and- the- one- most- teams- skip- because- "we- don't- have- logged- user- data- yet."- Start- with- 50- prompts- your- team- writes- by- hand- based- on- real- conversations- with- users.- It- is- not- perfect,- but- it- is- better- than- relying- on- public- benchmarks- that- the- model- may- have- already- memorised.

##- What- this- means

Eval- gaming- is- a- measurement- problem- before- it- is- a- model- problem.- Goodhart’s- Law- —- “When- a- measure- becomes- a- target,- it- ceases- to- be- a- good- measure”- —- applies- directly.- If- a- benchmark- is- the- score- that- drives- release- decisions,- the- model- will- be- optimised- (by- its- training- pipeline,- by- prompt- tuning,- by- eval-set- leakage)- toward- that- benchmark,- not- toward- the- task- the- benchmark- was- meant- to- approximate- [5].

The- mechanism- is- usually- one- of- three:

Benchmark- contamination.- Training- data- includes- examples- that- overlap- with- the- benchmark’s- test- set.- The- model- has- effectively- seen- the- answers- before.- Multiple- studies- have- found- GPT-3- and- GPT-4- evaluation- data- overlaps- with- widely- used- benchmarks- —- including- MMLU,- HumanEval,- and- GSM8K- —- raising- the- question- of- whether- scores- reflect- capability- or- memorisation- [4][2].
Overfitting- during- fine-tuning.- A- team- tunes- on- a- narrow- benchmark- distribution- until- scores- plateau.- The- model- memorises- surface- patterns- of- that- distribution- without- learning- the- underlying- skill.- Put- the- same- model- on- slightly- different- phrasing- or- a- different- domain,- and- performance- collapses.
Prompt- engineering- to- the- eval.- Prompts- are- iterated- against- known- eval- questions- until- the- answer- looks- right,- then- considered- “production- ready.”- The- prompt- looks- great- on- the- 50- eval- questions- but- fails- on- real- user- requests- that- fall- outside- the- eval- distribution.

- - - -

Prompt- engineering- to- the- eval- is- the- hardest- form- of- gaming- to- detect- because- it- looks- like- good- engineering- —- iterating- until- scores- improve.- The- tell- is- when- scores- improve- on- your- eval- set- but- user- satisfaction- metrics- stay- flat.- If- you- are- not- tracking- user- satisfaction- alongside- eval- scores,- you- will- miss- this- entirely.

##- Where- teams- get- it- wrong

###- Mistake- 1:- Treating- benchmark- improvement- as- proof- of- product- improvement

A- team- chooses- a- new- model- because- it- scores- 5- points- higher- on- MMLU- than- the- current- model.- They- deploy- it.- User- satisfaction- drops.- Complaints- about- irrelevant- answers- increase.

What- happened:- the- new- model- had- been- trained- or- fine-tuned- on- data- that- overlapped- with- MMLU’s- test- set,- inflating- its- benchmark- score.- On- real- user- queries- —- which- the- training- data- did- not- cover- —- it- performed- the- same- or- worse- than- the- previous- model.- The- team- treated- a- 5-point- benchmark- gain- as- a- signal- of- general- improvement- when- it- was- actually- a- signal- of- benchmark- familiarity- [1][2].

Consequence:- Deployed- a- worse- model.- Wasted- engineering- time.- Lost- user- trust.- The- fix- is- to- maintain- a- pre-deployment- holdout- test- set- of- real- user- conversations- before- any- model- switch.

###- Mistake- 2:- Reusing- the- training- set- as- the- test- set- in- disguise

A- team- splits- their- customer- support- dataset- into- 80%- training- and- 20%- test,- tunes- a- model,- and- reports- 94%- accuracy- on- the- test- set.- Great- results.- Then- the- same- model,- deployed- against- live- customer- queries- in- production,- scores- only- 67%- in- a- human- review.

The- hidden- problem:- the- “test- set”- was- drawn- from- the- same- distribution- as- the- training- data- —- same- customers,- same- query- types,- same- time- period.- The- model- learned- to- generalise- within- that- distribution- but- not- outside- it.- Real- customers- ask- questions- the- team- has- not- seen- before,- using- phrasing- not- present- in- any- historical- dataset.- The- model- had- never- been- tested- on- genuinely- out-of-distribution- queries- [3][1].

Consequence:- Overconfident- launch.- Customer-facing- failures- that- erode- trust.- The- fix- is- a- held-out- test- set- collected- from- a- different- time- period,- different- query- types,- or- a- different- customer- segment- than- anything- in- the- training- data.

###- Mistake- 3:- Ignoring- real-user- examples- that- are- never- part- of- the- benchmark

A- team- relies- entirely- on- public- benchmarks- (MMLU,- HumanEval,- GSM8K)- and- never- builds- a- task-specific- evaluation.- The- model- scores- well- enough- on- the- leaderboard- to- justify- a- purchase.- But- the- use- case- is- niche- —- medical- record- summarisation,- legal- contract- review,- or- internal- code- review- for- a- proprietary- codebase.- The- benchmarks- are- generic.- The- team- discovers- six- weeks- into- deployment- that- the- model- hallucinates- frequently- on- their- specific- data.

Consequence:- A- paid-for- model- that- does- not- work- for- the- actual- task.- Six- weeks- of- sunk- cost.- The- fix- is- a- domain-specific- golden- test- set- built- before- the- vendor- evaluation- starts,- using- real- anonymised- examples- from- the- target- workflow- [3][4].

##- Practical- decision- check

Before- trusting- a- benchmark- result,- ask:

Does- the- benchmark- reflect- real- user- tasks,- or- only- academic- NLP- tasks?- A- model- that- tops- MML- U- may- fail- at- your- specific- workflow.
Do- you- have- a- holdout- test- set- the- model- has- never- seen- —- and- that- comes- from- a- different- distribution- than- the- training- data?
Do- production- failures- cluster- around- cases- the- benchmark- does- not- cover?- Log- every- failed- answer- for- the- first- two- weeks- of- any- new- model- deployment.
Can- you- run- the- benchmark- yourself- with- your- own- inputs,- not- just- the- standard- questions?
What- is- the- benchmark’s- contamination- status?- Has- the- provider- published- a- contamination- analysis?- Check- model- cards- and- evaluation- reports- [4][2].

##- How- to- build- an- eval-gaming- check- into- your- workflow

###- 1.- Build- a- golden- test- set

Collect- 50–100- real- user- prompts- with- ground-truth- answers.- Include- edge- cases,- unusual- phrasing,- and- queries- the- model- has- historically- got- wrong.- Store- this- separately- from- any- training- data.- Never- use- it- for- fine-tuning,- prompt- iteration,- or- hyperparameter- search.- This- is- your- truth- set.

###- 2.- Run- the- golden- set- before- every- release- decision

Before- promoting- a- new- model- version,- a- prompt- change,- or- a- fine-tuning- run,- evaluate- on- both- the- public- benchmark- and- your- golden- set.- Compare- the- deltas.- A- positive- benchmark- delta- with- a- flat- or- negative- golden-set- delta- means- eval- gaming- is- happening.

###- 3.- Check- for- benchmark- contamination

Review- model- cards- and- technical- reports- for- contamination- analysis.- If- the- provider- does- not- publish- a- contamination- analysis- for- the- benchmarks- you- care- about,- treat- high- scores- with- caution.- Run- a- simple- test:- take- 10- benchmark- questions,- rephrase- them- in- different- words,- and- see- if- the- score- drops- significantly- [2][4].

###- 4.- Monitor- production- failures- against- benchmark- clusters

Log- every- answer- flagged- as- incorrect- or- irrelevant- in- the- first- two- weeks- of- any- new- model- deployment.- Categorise- them:- do- they- fall- into- benchmark-covered- areas- or- benchmark-blind- spots?- If- failures- are- concentrated- in- blind- spots,- your- benchmark- suite- needs- expansion.

###- 5.- Read- model- cards- for- gaming- risks

Every- model- card- should- tell- you:- what- data- was- highly- present- for- training,- what- benchmarks- were- used- for- evaluation,- whether- the- evaluation- data- overlaps- with- training- data,- and- what- known- failure- modes- exist.- If- any- of- these- are- missing,- consider- that- a- risk- signal- [4].

##- What- would- change- the- advice- on- this- page

The- guidance- above- assumes- that- benchmark- contamination- is- common,- that- model- providers- rarely- publish- thorough- contamination- checks,- and- that- golden- test- sets- remain- the- most- reliable- defence.- Each- of- these- assumptions- could- shift:

If- contamination- detection- becomes- standardised.- If- the- industry- adopts- a- mandatory- contamination-reporting- framework- with- third-party- verification,- the- “check- yourself”- guidance- becomes- less- urgent.- The- advice- would- shift- to:- verify- the- report,- then- trust- the- verified- scores.
If- benchmarks- adopt- dynamic- holdout- sets.- If- MMLU,- HumanEval,- and- other- major- benchmarks- begin- rotating- their- test- sets- or- using- holdout- questions- that- providers- cannot- train- on,- the- contamination- risk- drops- significantly.- The- advice- would- evolve- from- “assume- contamination”- to- “spot-check- for- residual- leakage.”
If- model- providers- publish- training-data- overlap- reports.- If- every- provider- ships- a- reproducible- overlap- analysis- for- their- training- data- against- every- major- benchmark,- the- burden- shifts- from- the- evaluator- to- the- vendor.- The- advice- would- become:- read- the- overlap- report;- if- none- exists,- assume- contamination.

##- Caveats- and- scope- boundaries

This- page- cannot- tell- you- whether- a- specific- leaderboard- result- is- fraudulent.- It- can- only- help- you- ask- whether- the- benchmark- is- teaching- the- model- how- to- pass- the- test- rather- than- how- to- do- the- work.
Benchmarks- will- always- be- incomplete- proxies- for- real- tasks.- Real- traffic- differs- from- curated- test- data;- golden- sets- must- be- highly- present- and- refreshed.
Contamination- analysis- is- not- yet- standardised- across- providers- as- of- May- 2026.- Treat- high- benchmark- scores- with- proportionally- higher- scrutiny.
This- is- operational- guidance,- not- a- guarantee- that- a- specific- model- is- or- is- not- gaming- its- eval.

##- Methodology

Data- checked:- 2026-05-28
Sources- consulted:- LM- Evaluation- Harness,- Stanford- HELM,- OpenAI- Evals- documentation,- NIST- AI- RMF,- Goodhart’s- Law- (concept- reference)
Assumptions:- The- reader- evaluates- or- purchases- models- based- on- benchmark- scores- and- needs- to- distinguish- genuine- capability- improvement- from- benchmark- overfitting
Limitations:- This- article- provides- practical- evaluation- guidance,- not- an- academic- survey- of- contamination- detection- methods.- It- does- not- cover- adversarial- eval- gaming- or- deliberate- benchmark- manipulation
Jurisdiction:- Global.- NIST- AI- RMF- (US)- and- Stanford- HELM- (US)- referenced

##- Source- list

LM- Evaluation- Harness- —- https://github.com/EleutherAI/lm-evaluation-harness- (accessed- 2026-05-28)
Stanford- HELM- —- https://crfm.stanford.edu/helm/latest/- (accessed- 2026-05-28)
OpenAI- Evals- docs- —- https://platform.openai.com/docs/guides/evals- (accessed- 2026-05-28)
NIST- AI- RMF- —- https://www.nist.gov/itl/ai-risk-management-framework- (accessed- 2026-05-28)

##- Related- guides- guides- guides

##- Trust- Stack

Last- checked:- 2026-05-28
Corrections:- Contact- us- to- report- errors

##- Change- log

2026-05-28:- Full- editorial- review- against- 16-gate- checklist.- Added- 3- Editor’s- Note- asides.- Added- Methodology,- Source- list- with- access- dates,- Trust- Stack,- slugified- heading- IDs,- and- Caveats- section.- Fixed- frontmatter- writtenBy- label.- Removed- internal- editorial- review- reference.- Corrected- related- guide- paths- to- relative- format.
2026-05-24:- First- draft- built- from- editorial- brief.- Revised- with- concrete- examples,- expanded- failure-mode- scenarios,- inline- citations,- practical- steps,- and- evidence-change- scenarios.

##- Change- log

2026-05-28:- Full- editorial- review- against- 16-gate- checklist.- Added- 3- Editor’s- Note- asides.- Added- Methodology,- Source- list- with- access- dates,- Trust- Stack,- slugified- heading- IDs,- and- Caveats- section.- Fixed- frontmatter- writtenBy- label.- Removed- internal- editorial- review- reference.- Corrected- related- guide- paths- to- relative- format.
2026-05-24:- First- draft- built- from- editorial- brief.- Revised- with- concrete- examples,- expanded- failure-mode- scenarios,- inline- citations,- practical- steps,- and- evidence-change- scenarios.