Eval gaming: when models optimise for the test rather than the task
When a model does well on a benchmark but disappoints in production, you may be looking at eval gaming. The system learned how to look good on the test, not how to do the job users actually care about.
Benchmarks can be useful and still be gameable. If the test predicts only the test, it is helping you less than you think.
Quick answer
Distinguish between benchmark performance and real-task performance from day one. Build a holdout test set of real user scenarios that is never used for tuning. Run it alongside every benchmark run. If benchmark scores climb but the holdout set stays flat — or gets worse — you are looking at eval gaming. Also watch for suspiciously clean results (near-perfect scores on complex tasks), models that were trained or tuned using examples close to the test set, and any evaluation where the test data overlaps with training data.
Concrete starting points: keep a human-written golden test set of 50–100 real user prompts with ground-truth answers; run that set before every prompt or model change; and compare benchmark deltas against golden-set deltas before any product decision.
What this means
Eval gaming is a measurement problem before it is a model problem. Goodhart’s Law — “When a measure becomes a target, it ceases to be a good measure” — applies directly. If a benchmark is the score that drives release decisions, the model will be optimised (by its training pipeline, by prompt tuning, by eval-set leakage) toward that benchmark, not toward the task the benchmark was meant to approximate [5].
The mechanism is usually one of three:
- Benchmark contamination. Training data includes examples that overlap with the benchmark’s test set. The model has effectively seen the answers before. Multiple studies have found GPT-3 and GPT-4 evaluation data overlaps with widely used benchmarks — including MMLU, HumanEval, and GSM8K — raising the question of whether scores reflect capability or memorisation [4][2].
- Overfitting during fine-tuning. A team tunes on a narrow benchmark distribution until scores plateau. The model memorises surface patterns of that distribution without learning the underlying skill. Put the same model on slightly different phrasing or a different domain, and performance collapses.
- Prompt engineering to the eval. Prompts are iterated against known eval questions until the answer looks right, then considered “production ready.” The prompt looks great on the 50 eval questions but fails on real user requests that fall outside the eval distribution.
The better the team understands the real task — the actual user problems the model must solve, not just the proxy questions in a benchmark — the easier it is to design evaluations that are hard to cheat by accident.
Where teams get it wrong
Mistake 1: Treating benchmark improvement as proof of product improvement
A team chooses a new model because it scores 5 points higher on MMLU than the current model. They deploy it. User satisfaction drops. Complaints about irrelevant answers increase.
What happened: the new model had been trained or fine-tuned on data that overlapped with MMLU’s test set, inflating its benchmark score. On real user queries — which the training data did not cover — it performed the same or worse than the previous model. The team treated a 5-point benchmark gain as a signal of general improvement when it was actually a signal of benchmark familiarity [1][2].
Consequence: Deployed a worse model. Wasted engineering time. Lost user trust. The fix is to maintain a private holdout test set of real user conversations before any model switch.
Mistake 2: Reusing the training set as the test set in disguise
A team splits their customer support dataset into 80% training and 20% test, tunes a model, and reports 94% accuracy on the test set. Great results. Then the same model, deployed against live customer queries in production, scores only 67% in a human review.
The hidden problem: the “test set” was drawn from the same distribution as the training data — same customers, same query types, same time period. The model learned to generalise within that distribution but not outside it. Real customers ask questions the team has not seen before, using phrasing not present in any historical dataset. The model had never been tested on genuinely out-of-distribution queries [3][1].
Consequence: Overconfident launch. Customer-facing failures that erode trust. The fix is a held-out test set collected from a different time period, different query types, or a different customer segment than anything in the training data.
Mistake 3: Ignoring real-user examples that are never part of the benchmark
A team relies entirely on public benchmarks (MMLU, HumanEval, GSM8K) and never builds a task-specific evaluation. The model scores well enough on the leaderboard to justify a purchase. But the use case is niche — medical record summarisation, legal contract review, or internal code review for a proprietary codebase. The benchmarks are generic. The team discovers six weeks into deployment that the model hallucinates frequently on their specific data.
Consequence: A paid-for model that does not work for the actual task. Six weeks of sunk cost. The fix is a domain-specific golden test set built before the vendor evaluation starts, using real anonymised examples from the target workflow [3][4].
Practical decision check
Before trusting a benchmark result, ask:
- Does the benchmark reflect real user tasks, or only academic NLP tasks? A model that tops MMLU may fail at your specific workflow.
- Do you have a holdout test set the model has never seen — and that comes from a different distribution than the training data?
- Do production failures cluster around cases the benchmark does not cover? Log every failed answer for the first two weeks of any new model deployment.
- Can you run the benchmark yourself with your own inputs, not just the standard questions?
- What is the benchmark’s contamination status? Has the provider published a contamination analysis? Check model cards and evaluation reports [4][2].
How to build an eval-gaming check into your workflow
This is a how-to section. The concept is only useful if you can act on it.
1. Build a golden test set (30 minutes, then ongoing)
Collect 50–100 real user prompts with ground-truth answers. Include edge cases, unusual phrasing, and queries the model has historically got wrong. Store this separately from any training data. Never use it for fine-tuning, prompt iteration, or hyperparameter search. This is your truth set.
2. Run the golden set before every release decision
Before promoting a new model version, a prompt change, or a fine-tuning run, evaluate on both the public benchmark and your golden set. Compare the deltas. A positive benchmark delta with a flat or negative golden-set delta means eval gaming is happening.
3. Check for benchmark contamination
Review model cards and technical reports for contamination analysis. If the provider does not publish a contamination analysis for the benchmarks you care about, treat high scores with caution. Run a simple test: take 10 benchmark questions, rephrase them in different words, and see if the score drops significantly [2][4].
4. Monitor production failures against benchmark clusters
Log every answer flagged as incorrect or irrelevant in the first two weeks of any new model deployment. Categorise them: do they fall into benchmark-covered areas or benchmark-blind spots? If failures are concentrated in blind spots, your benchmark suite needs expansion.
5. Read model cards for gaming risks
Every model card should tell you: what data was used for training, what benchmarks were used for evaluation, whether the evaluation data overlaps with training data, and what known failure modes exist. If any of these are missing, consider that a risk signal [4].
What would change the advice on this page
The guidance above assumes that benchmark contamination is common, that model providers rarely publish thorough contamination checks, and that golden test sets remain the most reliable defence. Each of these assumptions could shift:
- If contamination detection becomes standardised. If the industry adopts a mandatory contamination-reporting framework with third-party verification, the “check yourself” guidance becomes less urgent. The advice would shift to: verify the report, then trust the verified scores.
- If benchmarks adopt dynamic holdout sets. If MMLU, HumanEval, and other major benchmarks begin rotating their test sets or using holdout questions that providers cannot train on, the contamination risk drops significantly. The advice would evolve from “assume contamination” to “spot-check for residual leakage.”
- If model providers publish training-data overlap reports. If every provider ships a reproducible overlap analysis for their training data against every major benchmark, the burden shifts from the evaluator to the vendor. The advice would become: read the overlap report; if none exists, assume contamination.
What this page cannot tell you
This page cannot tell you whether a specific leaderboard result is fraudulent. It can only help you ask whether the benchmark is teaching the model how to pass the test rather than how to do the work.
This is measurement guidance, not a guarantee against manipulation. A determined team can game almost any eval. The defence is to make gaming expensive enough that honest measurement is cheaper.
Methodology and sources
Check date: 2026-05-24
What was checked: benchmarking, evaluation, holdout-set documentation, contamination studies, and model-card reporting practices.
What the sources were used for:
- [1] LM Evaluation Harness — testing methodology and benchmark-overfitting patterns
- [2] Stanford HELM — holistic evaluation framework and contamination analysis
- [3] OpenAI Evals docs — evaluation design and holdout-set guidance
- [4] NIST AI RMF — risk-management framework covering measurement validity
- [5] Goodhart’s Law — theoretical foundation for measure-target degradation
Assumptions and limits:
- Benchmarks will always be incomplete proxies for real tasks.
- Real traffic differs from curated test data; golden sets must be refreshed.
- Contamination analysis is not yet standardised across providers.
- This is operational guidance, not a guarantee that a specific model is or is not gaming its eval.
Change log
- 2026-05-24: first draft built from the llm-editor-approved brief.
- 2026-05-24: revised after editorial review LLM-0037. Added concrete examples (Goodhart’s Law, benchmark contamination case, distribution-mismatch scenario, domain-specific niche failure), expanded each “Where teams get it wrong” into a worked scenario with consequences and fixes, removed Editor’s Notes into finished copy, added inline citations [1]–[5], replaced how-to section with practical steps, added “What would change this advice” with three evidence-change scenarios, and fixed related-guide links to production routes.
Source list
- [1] LM Evaluation Harness — https://github.com/EleutherAI/lm-evaluation-harness
- [2] Stanford HELM — https://crfm.stanford.edu/helm/latest/
- [3] OpenAI Evals docs — https://platform.openai.com/docs/guides/evals
- [4] NIST AI RMF — https://www.nist.gov/itl/ai-risk-management-framework
- [5] Goodhart’s Law — concept reference; see e.g., Marilyn Strathern, “Improving ratings: audit in the British University system” (1997)