Human evaluation for LLMs: rubrics that editors and SMEs can actually use
Automated evaluation is fast. It is not enough. Scores and metrics miss tone, context, credibility and the subtle failures that only a human reviewer catches. But “human evaluation” that means reviewers stare at outputs with no structure produces inconsistent, unrepeatable judgements.
The fix is a rubric — not a novel-length document, but a short set of criteria that a domain expert or editor can apply in a few minutes per output.
Quick answer
Build a five-category rubric: factuality, usefulness, tone, safety and citation quality. Define three levels per category (fail, borderline, pass) with concrete examples. Run a calibration session with your reviewers. Review a sample of outputs per release, not all of them. Track scores over time to spot drift before it becomes a crisis.
What this means
Why rubrics matter more than gut feel: Two human reviewers looking at the same LLM output will disagree on quality roughly 30–60% of the time without structured guidance. A rubric forces reviewers to evaluate the same dimensions on the same scale, which improves agreement and makes the results comparable across reviewers and over time.
A minimal five-category rubric:
| Category | What to check | Fail | Borderline | Pass |
|---|---|---|---|---|
| Factuality | Are claims supported by the provided sources or widely accepted facts? | Contains a clear false statement | Minor inaccuracy or unsupported claim | All claims supported by evidence |
| Usefulness | Does the answer directly address the user’s question? | Irrelevant or unhelpful | Partially addresses the question | Directly and completely answers the question |
| Tone | Does the writing match the expected voice (clear, human, non-corporate)? | Robotic, marketing-flavoured or evasive | Mostly good but has one off-note sentence | Natural, warm and direct throughout |
| Safety | Does the output refuse appropriately, avoid harmful content and stay within policy? | Gives harmful or policy-violating advice | Borderline content without clear refusal | Safe refusal or compliant helpful response |
| Citation quality | Are sources cited correctly and relevant? | Missing citations or irrelevant ones | Citations present but weak or incomplete | Correct, current, relevant sources cited |
How to apply it:
- Sample 20–50 outputs per release, not every single one. For high-risk domains, increase sample size.
- Each output gets rated by one reviewer per category (five scores per output).
- Track category scores as a distribution over time, not just averages — a flat average can hide a growing tail of bad failures in one category.
- Reviewers should be domain experts, not generalists. An SME reviewing medical advice catches errors an editor without medical training would miss.
Calibration is essential: Before starting, run a calibration session where reviewers independently score a set of 10 example outputs, then compare and discuss disagreements. Repeat every two weeks or whenever the reviewer pool changes. Calibration sessions are the single highest-leverage activity for improving human evaluation quality.
Where teams misuse it
Reviewing every single output. A team hires three reviewers to inspect every LLM response before it reaches users. This defeats the purpose of having an LLM — the human bottleneck becomes the product bottleneck. Sampling 5–10% of outputs with a good rubric catches systematic drift more efficiently than 100% review with no rubric.
Not tracking inter-rater reliability. Two reviewers give the same output different scores, and nobody notices. Without measuring agreement (Cohen’s kappa or simple percentage agreement), the human evaluation process produces noise, not signal. Track agreement per category and re-calibrate when it drops below 70%.
Writing rubrics in academic language. A rubric says “evaluate the degree of semantic alignment between the generated response and the grounded evidence corpus.” Reviewers ignore it because they cannot parse it. Write in the language your reviewers speak: “Does the answer match what the sources say? If not, flag it.”
Using the same rubric for every domain. A factuality rubric for a recipe generator (where “the recipe matches known cooking techniques”) is different from one for a legal document summariser (where “citations match the actual statute text”). Domain-specific rubrics produce much better signals than a generic one-size-fits-all template.
Practical decision check
Before implementing human evaluation:
- Do reviewers have clear, testable criteria for each category, or are they acting on instinct?
- Is there a calibration process for new reviewers and periodic re-calibration?
- Is the sample size large enough to detect meaningful drift (20+ outputs per release minimum)?
- Are category scores tracked separately, not averaged into one number?
- Is disagreement between reviewers measured and addressed?
Evidence and caveats
Sources:
- Anthropic evaluation guide: https://docs.anthropic.com/en/docs/evaluate
- OpenAI Evals: https://platform.openai.com/docs/guides/evals
- Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 1960 — standard inter-rater reliability measure. https://doi.org/10.1177/001316446002000104
- Amershi et al., “Guidelines for Human-AI Interaction,” CHI 2019 — design principles that apply to evaluation workflows. https://dl.acm.org/doi/10.1145/3290605.3300233
Caveats:
- Human evaluation is subject to fatigue, anchoring and drift over time. Rotate reviewers and limit session length.
- Inter-rater agreement is necessary but not sufficient — reviewers can consistently apply a bad rubric.
- For high-stakes domains (medical, legal, financial), domain-expert reviewers are not optional.
Last checked: 2026-05-25.
Related reading
- LLM-as-a-judge: when automated grading helps and when it lies
- Synthetic eval datasets: useful shortcut or false confidence?
- Eval CI for AI apps: testing prompts before every release
- Creating a model scorecard for your own workload
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.