Coding benchmarks explained: HumanEval, MBPP, SWE-bench and real developer work

When a model launch claims “state-of-the-art coding performance,” the supporting evidence is usually a score on HumanEval, MBPP or SWE-bench. These benchmarks measure different things, and none of them measures whether the model is useful for day-to-day development work.

TL;DR

HumanEval tests whether a model can generate a single function from a docstring — 164 problems, pass/fail based on unit tests.
MBPP is similar but larger (974 problems) and slightly easier.
SWE-bench tests whether a model can edit a real GitHub repository to fix an issue — much closer to real developer work but measures different skills.

A model that passes HumanEval at 90% can still write bad code in a real repository, fail to understand project structure, or produce insecure code that passes tests but is broken in production. Coding benchmarks are useful for comparing model generations. They are not a measure of developer productivity.

What this means

HumanEval (OpenAI, 2021): 164 hand-written programming problems. Each problem is a docstring describing what the function should do, plus a function signature. The model generates the function body. Scoring is pass/fail based on unit tests. HumanEval tests: can the model write a correct function given a clear specification and a familiar API pattern? It does not test: multi-file refactoring, understanding unfamiliar codebases, security awareness, or debugging existing code.

MBPP (Google, 2021): 974 problems similar to HumanEval but with more straightforward specifications and simpler required logic. Scores tend to be higher than HumanEval for the same model. MBPP tests: can the model handle a broader range of programming tasks? It does not test: anything beyond single-function generation.

SWE-bench (Princeton, 2023): 2,294 real GitHub issues from 12 popular Python repositories. The model must edit the repository code to resolve the issue. The evaluation checks whether the model’s patch passes the repository’s existing test suite. SWE-bench tests: can the model understand an unfamiliar codebase, locate the relevant code, make a correct fix and pass existing tests? It does not test: whether the fix is the best or most maintainable approach, whether it introduces security vulnerabilities, or whether the model can collaborate with a human developer.

The gap between HumanEval and SWE-bench is roughly the gap between writing a single function from a specification and working on a real team codebase. A model that scores well on HumanEval is not necessarily useful for real development work; a model that scores well on SWE-bench is closer but still not a replacement for a developer.

Where teams misuse them

Claiming “state-of-the-art coding” based on HumanEval alone. A model scores 92% on HumanEval, and the press release says it is the best coding model. HumanEval is well past saturation — many models score above 85% — and it tests a narrow skill. A model cannot be “the best coder” based on 164 single-function problems. SWE-bench results, agent-bench results and hands-on testing matter more.

Using benchmark pass rates to estimate developer productivity. A vendor claims “model X completes coding tasks 50% faster.” The evidence is a HumanEval score. HumanEval pass rates have no demonstrated correlation with real-world developer velocity, task completion or code quality. The benchmark was designed for model comparison, not productivity measurement.

Comparing scores across different evaluation setups. One group runs SWE-bench with the model as a coding agent using file-editing tools and self-healing retries. Another runs it with direct code generation and no iteration. Both publish “SWE-bench score: 43%” but the results are not comparable — the scaffolding and agent framework dramatically affect outcomes. Always check the evaluation configuration, not just the number.

Assuming passing all unit tests means the code is correct. A patch that passes the existing test suite for a repository may still introduce subtle regressions, miss edge cases, or rely on insecure patterns that the tests do not cover. Unit test pass rate is a necessary condition for code quality, not a sufficient one.

Practical decision check

When evaluating a coding model claim:

Which benchmark is cited? HumanEval/MBPP test function generation; SWE-bench tests repository editing.
What evaluation setup was used — direct generation, agent-based with iteration, or human-assisted?
Is the score difference from the next best model large (>5 points) or within noise range?
Has the benchmark been independently verified, or is the score self-reported?
Does the model’s actual output on your codebase match the benchmark promise?

Methodology

Data checked: 2026-05-25
Sources consulted: Original benchmark papers (HumanEval, MBPP, SWE-bench), benchmark repositories and leaderboards, and published model evaluation documentation.
Assumptions: Benchmark scores are point-in-time and may change with new model releases or evaluation methodology updates. Agent scaffolding and evaluation configuration significantly affect SWE-bench results.
Limitations: This guide covers the three most widely cited coding benchmarks but does not cover newer or less established benchmarks (BigCodeBench, LiveCodeBench). No coding benchmark currently measures code security, maintainability or readability.
Jurisdiction: Global. Coding benchmarks are jurisdiction-agnostic; specific regulatory requirements for AI-assisted code may apply in regulated industries.

Source list

HumanEval: Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021 — https://github.com/openai/human-eval (accessed 2026-05-25)
MBPP: Austin et al., “Program Synthesis with Large Language Models,” arXiv:2108.07732, 2021 — https://github.com/google-research/google-research/tree/master/mbpp (accessed 2026-05-25)
SWE-bench: Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023 — https://www.swebench.com/ (accessed 2026-05-25)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: full editorial review — added Editor’s Notes, Trust Stack, slugified heading IDs, formal Methodology and Source list sections, renamed “Related reading” to “Related guides”, fixed writtenBy label (editor)
2026-05-25: first published