theLLMs

Last checked: 2026-05-25

Scope: Global. Contamination literature and model-card decontamination practices checked 2026-05-25.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Contamination and leakage: why benchmark scores can be too good

A model scores 95% on a coding benchmark. The press release celebrates a breakthrough. Six months later, researchers discover that 30% of the benchmark questions appeared verbatim in the model’s training data. The score was real — but it was memorisation, not reasoning.

This is contamination: when evaluation data leaks into training data, inflating scores by an unknown amount. It is the most widespread and least discussed problem in LLM evaluation.

Quick answer

Contamination means you cannot trust any benchmark score without evidence that the evaluation data was excluded from training. Always check model cards for decontamination procedures, look for score saturation (when a benchmark shows minimal variance between models), and treat any benchmark score as an upper bound — the true capability score is probably lower.

What this means

How contamination happens: The internet is the training data for most large models. Benchmark questions, answers and discussions are on the internet. If a model trains on a dataset that includes a blog post analysing HumanEval question 42, the model has effectively seen the answer before being tested on it. Common sources of contamination include:

  • Benchmark questions published on GitHub or arXiv before the training data cutoff.
  • Benchmark solutions, discussions or analyses posted on forums, blogs and social media.
  • Previous model outputs that include benchmark questions, fed back into training data for the next generation.
  • Shared evaluation harness code that includes the test questions as part of the framework.

Why it is hard to detect: Models do not tell you they have seen a benchmark question before. There is no “I remember this” flag. Detection requires systematic decontamination — comparing training data against benchmark data, removing overlaps and re-running — which most model providers do incompletely or not at all. The few that publish decontamination results (Anthropic, some of the EleutherAI models) are the exception, not the rule.

Signs of contamination:

  • Score saturation: If every model scores above 85% on a benchmark, the benchmark may be too easy or contaminated to the point where memorisation drives most of the score. MMLU is approaching this state.
  • Score compression: If the gap between the best and worst models on a benchmark is small (a few percentage points) despite obvious real-world performance differences, the benchmark is probably not measuring what it claims to measure — or contamination has narrowed the range artificially.
  • Performance on novel variants: A model that scores high on standard GSM8K but collapses on a structurally identical but numerically different variant likely memorised the original questions rather than learning the reasoning skill.
  • Disproportionate improvement: A model that improves dramatically on one benchmark without corresponding improvements on related benchmarks may have been contaminated on that specific dataset — especially if the benchmark questions were published before the training cutoff.

Where teams misuse it

Citing scores without checking decontamination. A team chooses a model based on its MMLU score of 90%+ without checking whether MMLU questions were excluded from training. The vendor’s model card either does not mention decontamination or describes a partial process that removes only exact matches. The true capability may be meaningfully lower.

Contaminating your own evaluation. A team uses a public benchmark dataset to evaluate their fine-tuned model, but the model was fine-tuned on a dataset that included — directly or through web-crawled data — examples from the same benchmark. The fine-tuning process inadvertently contaminated the eval. The score goes up, but the model has not actually improved: it has memorised.

Treating decontamination as binary. A vendor says “we decontaminated our eval data.” That could mean they removed 100% of exact matches (common) or they removed 100% of exact and near-duplicate matches (less common) or they ran a systematic n-gram overlap check with a strict threshold (rare). Without knowing the method, “decontaminated” means almost nothing.

Using leaders that reward self-reported scores. A leaderboard accepts self-reported scores without verifying decontamination, evaluation methodology or even that the reported number came from the claimed model. The leaderboard becomes a marketing tool, not an evaluation tool.

Practical decision check

Before trusting a benchmark score:

  1. Did the model provider publish their decontamination procedure, or is it unstated?
  2. What decontamination method was used — exact match only, or near-duplicate/n-gram overlap removal too?
  3. Is the benchmark widely saturated (most models above 85%)? If so, the score tells you less than it used to.
  4. Has the model been tested on a held-out evaluation set that was created after the training data cutoff?
  5. Does the model’s performance correlate with real-world behaviour you can observe directly?

Evidence and caveats

Sources:

Caveats:

  • Decontamination is never perfect. Near-duplicate detection is computationally expensive and has blind spots.
  • Brand-new benchmarks (created after training cutoff) avoid contamination but have less established validity.
  • Contamination is probabilistic, not binary — a benchmark may be partially contaminated, inflating some question categories more than others.

Last checked: 2026-05-25.

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.