Coding benchmarks explained: HumanEval, MBPP, SWE-bench and real developer work
When a model launch claims “state-of-the-art coding performance,” the supporting evidence is usually a score on HumanEval, MBPP or SWE-bench. These benchmarks measure different things, and none of them measures whether the model is useful for day-to-day development work.
Quick answer
- HumanEval tests whether a model can generate a single function from a docstring — 164 problems, pass/fail based on unit tests.
- MBPP is similar but larger (974 problems) and slightly easier.
- SWE-bench tests whether a model can edit a real GitHub repository to fix an issue — much closer to real developer work but measures different skills.
A model that passes HumanEval at 90% can still write bad code in a real repository, fail to understand project structure, or produce insecure code that passes tests but is broken in production. Coding benchmarks are useful for comparing model generations. They are not a measure of developer productivity.
What this means
HumanEval (OpenAI, 2021): 164 hand-written programming problems. Each problem is a docstring describing what the function should do, plus a function signature. The model generates the function body. Scoring is pass/fail based on unit tests. HumanEval tests: can the model write a correct function given a clear specification and a familiar API pattern? It does not test: multi-file refactoring, understanding unfamiliar codebases, security awareness, or debugging existing code.
MBPP (Google, 2021): 974 problems similar to HumanEval but with more straightforward specifications and simpler required logic. Scores tend to be higher than HumanEval for the same model. MBPP tests: can the model handle a broader range of programming tasks? It does not test: anything beyond single-function generation.
SWE-bench (Princeton, 2023): 2,294 real GitHub issues from 12 popular Python repositories. The model must edit the repository code to resolve the issue. The evaluation checks whether the model’s patch passes the repository’s existing test suite. SWE-bench tests: can the model understand an unfamiliar codebase, locate the relevant code, make a correct fix and pass existing tests? It does not test: whether the fix is the best or most maintainable approach, whether it introduces security vulnerabilities, or whether the model can collaborate with a human developer.
The gap between HumanEval and SWE-bench is roughly the gap between writing a single function from a specification and working on a real team codebase. A model that scores well on HumanEval is not necessarily useful for real development work; a model that scores well on SWE-bench is closer but still not a replacement for a developer.
Where teams misuse them
Claiming “state-of-the-art coding” based on HumanEval alone. A model scores 92% on HumanEval, and the press release says it is the best coding model. HumanEval is well past saturation — many models score above 85% — and it tests a narrow skill. A model cannot be “the best coder” based on 164 single-function problems. SWE-bench results, agent-bench results and hands-on testing matter more.
Using benchmark pass rates to estimate developer productivity. A vendor claims “model X completes coding tasks 50% faster.” The evidence is a HumanEval score. HumanEval pass rates have no demonstrated correlation with real-world developer velocity, task completion or code quality. The benchmark was designed for model comparison, not productivity measurement.
Comparing scores across different evaluation setups. One group runs SWE-bench with the model as a coding agent using file-editing tools and self-healing retries. Another runs it with direct code generation and no iteration. Both publish “SWE-bench score: 43%” but the results are not comparable — the scaffolding and agent framework dramatically affect outcomes. Always check the evaluation configuration, not just the number.
Assuming passing all unit tests means the code is correct. A patch that passes the existing test suite for a repository may still introduce subtle regressions, miss edge cases, or rely on insecure patterns that the tests do not cover. Unit test pass rate is a necessary condition for code quality, not a sufficient one.
Practical decision check
When evaluating a coding model claim:
- Which benchmark is cited? HumanEval/MBPP test function generation; SWE-bench tests repository editing.
- What evaluation setup was used — direct generation, agent-based with iteration, or human-assisted?
- Is the score difference from the next best model large (>5 points) or within noise range?
- Has the benchmark been independently verified, or is the score self-reported?
- Does the model’s actual output on your codebase match the benchmark promise?
Evidence and caveats
Sources:
- HumanEval: Chen et al., “Evaluating Large Language Models Trained on Code,” arXiv:2107.03374, 2021. https://github.com/openai/human-eval
- MBPP: Austin et al., “Program Synthesis with Large Language Models,” arXiv:2108.07732, 2021.
- SWE-bench: Jimenez et al., “SWE-bench: Can Language Models Resolve Real-World GitHub Issues?” arXiv:2310.06770, 2023. https://www.swebench.com/
- SWE-bench leaderboard: https://www.swebench.com/
Caveats:
- Benchmark contamination is a known issue: some coding problems are present in training data, inflating scores.
- SWE-bench scores are highly dependent on the agent framework and tooling used — same model, different scaffolding, very different results.
- No coding benchmark currently measures code security, maintainability or readability.
Last checked: 2026-05-25.
Related reading
- How LLM benchmarks work, and what they miss
- Function-calling benchmarks: why tool-use scores do not guarantee agents work
- AI coding agents: what to measure before trusting them
- Contamination and leakage: why benchmark scores can be too good
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.