How LLM benchmarks work, and what they miss
A cautious guide to benchmark scores, leaderboard claims, and the gap between test performance and real-world usefulness.
Evals
Evaluation is the part most teams skip — and the part that determines whether an AI product actually works. Benchmarks, human review rubrics, contamination detection, synthetic data risks, and regression testing for prompts. No leaderboard worship, no vibes-based confidence.
Published now
A cautious guide to benchmark scores, leaderboard claims, and the gap between test performance and real-world usefulness.
When benchmark questions appear in training data, scores inflate. What contamination looks like, how to spot the signs, and how to evaluate models without being fooled by leaked scores.
Using LLMs to generate evaluation data is fast and cheap. When synthetic datasets save time, when they mislead, and how to review them before trusting the results.
Using one LLM to grade another is fast — but position bias, verbosity favouritism, and self-reinforcement mean the scores can mislead. When to trust automated grading and when it needs human backup.
A practical guide to designing human review processes for LLM outputs — simple rubrics for factuality, usefulness, tone, safety and citation quality that non-specialist reviewers can actually apply.
Why accuracy alone is not enough — the HELM framework shows how calibration, robustness, fairness and efficiency should sit together in any model evaluation.
What the major coding benchmarks actually measure, how they differ from real development work, and how to interpret coding model claims without overselling them.
Search by idea
Try "how much do tokens cost?", "run a model on my own hardware", or "stop prompt injection attacks". Search runs in your browser against our article index.