Contamination and leakage: why benchmark scores can be too good

A model scores 95% on a coding benchmark. The press release celebrates a breakthrough. Six months later, researchers discover that 30% of the benchmark questions appeared verbatim in the model’s training data. The score was real — but it was memorisation, not reasoning.

This is contamination: when evaluation data leaks into training data, inflating scores by an unknown amount. It is the most widespread and least discussed problem in LLM evaluation.

TL;DR

Contamination means you cannot trust any benchmark score without evidence that the evaluation data was excluded from training. Always check model cards for decontamination procedures, look for score saturation (when a benchmark shows minimal variance between models), and treat any benchmark score as an upper bound — the true capability score is probably lower.

What this means

How contamination happens: The internet is the training data for most large models. Benchmark questions, answers and discussions are on the internet. If a model trains on a dataset that includes a blog post analysing HumanEval question 42, the model has effectively seen the answer before being tested on it. Common sources of contamination include:

Benchmark questions published on GitHub or arXiv before the training data cutoff.
Benchmark solutions, discussions or analyses posted on forums, blogs and social media.
Previous model outputs that include benchmark questions, fed back into training data for the next generation.
Shared evaluation harness code that includes the test questions as part of the framework.

Why it is hard to detect: Models do not tell you they have seen a benchmark question before. There is no “I remember this” flag. Detection requires systematic decontamination — comparing training data against benchmark data, removing overlaps and re-running — which most model providers do incompletely or not at all. The few that publish decontamination results (Anthropic, some of the EleutherAI models) are the exception, not the rule.

Signs of contamination:

Score saturation: If every model scores above 85% on a benchmark, the benchmark may be too easy or contaminated to the point where memorisation drives most of the score. MMLU is approaching this state.
Score compression: If the gap between the best and worst models on a benchmark is small (a few percentage points) despite obvious real-world performance differences, the benchmark is probably not measuring what it claims to measure — or contamination has narrowed the range artificially.
Performance on novel variants: A model that scores high on standard GSM8K but collapses on a structurally identical but numerically different variant likely memorised the original questions rather than learning the reasoning skill.
Disproportionate improvement: A model that improves dramatically on one benchmark without corresponding improvements on related benchmarks may have been contaminated on that specific dataset — especially if the benchmark questions were published before the training cutoff.

Where teams misuse it

Citing scores without checking decontamination. A team chooses a model based on its MMLU score of 90%+ without checking whether MMLU questions were excluded from training. The vendor’s model card either does not mention decontamination or describes a partial process that removes only exact matches. The true capability may be meaningfully lower.

Contaminating your own evaluation. A team uses a public benchmark dataset to evaluate their fine-tuned model, but the model was fine-tuned on a dataset that included — directly or through web-crawled data — examples from the same benchmark. The fine-tuning process inadvertently contaminated the eval. The score goes up, but the model has not actually improved: it has memorised.

Treating decontamination as binary. A vendor says “we decontaminated our eval data.” That could mean they removed 100% of exact matches (common) or they removed 100% of exact and near-duplicate matches (less common) or they ran a systematic n-gram overlap check with a strict threshold (rare). Without knowing the method, “decontaminated” means almost nothing.

Using leaders that reward self-reported scores. A leaderboard accepts self-reported scores without verifying decontamination, evaluation methodology or even that the reported number came from the claimed model. The leaderboard becomes a marketing tool, not an evaluation tool.

Practical decision check

Before trusting a benchmark score:

Did the model provider publish their decontamination procedure, or is it unstated?
What decontamination method was used — exact match only, or near-duplicate/n-gram overlap removal too?
Is the benchmark widely saturated (most models above 85%)? If so, the score tells you less than it used to.
Has the model been tested on a held-out evaluation set that was created after the training data cutoff?
Does the model’s performance correlate with real-world behaviour you can observe directly?

Methodology

Data checked: 2026-05-25
Sources consulted: Academic literature on benchmark contamination (Sainz et al. 2023, Carlini et al. 2022), model card decontamination procedures (Anthropic, GPT-3 paper), and published contamination detection frameworks.
Assumptions: Decontamination methodology varies significantly across providers. Exact-match removal is common; near-duplicate and n-gram overlap removal is rarer. Contamination is probabilistic, not binary.
Limitations: This guide covers contamination in public benchmarks. It does not cover contamination in proprietary evaluation datasets or in fine-tuning datasets derived from model outputs. Decontamination is never perfect.
Jurisdiction: Global. Contamination is a methodological concern, not a jurisdiction-specific issue, though regulatory frameworks (EU AI Act, NIST AI RMF) increasingly expect documentation of evaluation validity.

Source list

Brown et al., “Language Models are Few-Shot Learners,” NeurIPS 2020 — https://arxiv.org/abs/2005.14165 (accessed 2026-05-25)
Anthropic model cards — https://docs.anthropic.com/en/docs/about-claude/model-card (accessed 2026-05-25)
Sainz et al., “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for Each Benchmark,” EMNLP 2023 — https://aclanthology.org/2023.emnlp-main.690/ (accessed 2026-05-25)
Carlini et al., “Quantifying Memorization Across Neural Language Models,” arXiv:2202.07646, 2022 — https://arxiv.org/abs/2202.07646 (accessed 2026-05-25)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: full editorial review — added Editor’s Notes, Trust Stack, slugified heading IDs, formal Methodology and Source list sections, renamed “Related reading” to “Related guides”, fixed writtenBy label (editor)
2026-05-29: Added missing slugs to TL;DR, Source list, Trust Stack, Change log, and Related guides H2 headings (G8 fix)
2026-05-25: first published