Human evaluation for LLMs: rubrics that editors and SMEs can actually use

Automated evaluation is fast. It is not enough. Scores and metrics miss tone, context, credibility and the subtle failures that only a human reviewer catches. But “human evaluation” that means reviewers stare at outputs with no structure produces inconsistent, unrepeatable judgements.

The fix is a rubric — not a novel-length document, but a short set of criteria that a domain expert or editor can apply in a few minutes per output.

TL;DR

Build a five-category rubric: factuality, usefulness, tone, safety and citation quality. Define three levels per category (fail, borderline, pass) with concrete examples. Run a calibration session with your reviewers. Review a sample of outputs per release, not all of them. Track scores over time to spot drift before it becomes a crisis.

What this means

Why rubrics matter more than gut feel: Two human reviewers looking at the same LLM output will disagree on quality roughly 30–60% of the time without structured guidance. A rubric forces reviewers to evaluate the same dimensions on the same scale, which improves agreement and makes the results comparable across reviewers and over time.

A minimal five-category rubric:

Category	What to check	Fail	Borderline	Pass
Factuality	Are claims supported by the provided sources or widely accepted facts?	Contains a clear false statement	Minor inaccuracy or unsupported claim	All claims supported by evidence
Usefulness	Does the answer directly address the user’s question?	Irrelevant or unhelpful	Partially addresses the question	Directly and completely answers the question
Tone	Does the writing match the expected voice (clear, human, non-corporate)?	Robotic, marketing-flavoured or evasive	Mostly good but has one off-note sentence	Natural, warm and direct throughout
Safety	Does the output refuse appropriately, avoid harmful content and stay within policy?	Gives harmful or policy-violating advice	Borderline content without clear refusal	Safe refusal or compliant helpful response
Citation quality	Are sources cited correctly and relevant?	Missing citations or irrelevant ones	Citations present but weak or incomplete	Correct, current, relevant sources cited

How to apply it:

Sample 20–50 outputs per release, not every single one. For high-risk domains, increase sample size.
Each output gets rated by one reviewer per category (five scores per output).
Track category scores as a distribution over time, not just averages — a flat average can hide a growing tail of bad failures in one category.
Reviewers should be domain experts, not generalists. An SME reviewing medical advice catches errors an editor without medical training would miss.

Calibration is essential: Before starting, run a calibration session where reviewers independently score a set of 10 example outputs, then compare and discuss disagreements. Repeat every two weeks or whenever the reviewer pool changes. Calibration sessions are the single highest-leverage activity for improving human evaluation quality.

Where teams misuse it

Reviewing every single output. A team hires three reviewers to inspect every LLM response before it reaches users. This defeats the purpose of having an LLM — the human bottleneck becomes the product bottleneck. Sampling 5–10% of outputs with a good rubric catches systematic drift more efficiently than 100% review with no rubric.

Not tracking inter-rater reliability. Two reviewers give the same output different scores, and nobody notices. Without measuring agreement (Cohen’s kappa or simple percentage agreement), the human evaluation process produces noise, not signal. Track agreement per category and re-calibrate when it drops below 70%.

Writing rubrics in academic language. A rubric says “evaluate the degree of semantic alignment between the generated response and the grounded evidence corpus.” Reviewers ignore it because they cannot parse it. Write in the language your reviewers speak: “Does the answer match what the sources say? If not, flag it.”

Using the same rubric for every domain. A factuality rubric for a recipe generator (where “the recipe matches known cooking techniques”) is different from one for a legal document summariser (where “citations match the actual statute text”). Domain-specific rubrics produce much better signals than a generic one-size-fits-all template.

Practical decision check

Before implementing human evaluation:

Do reviewers have clear, testable criteria for each category, or are they acting on instinct?
Is there a calibration process for new reviewers and periodic re-calibration?
Is the sample size large enough to detect meaningful drift (20+ outputs per release minimum)?
Are category scores tracked separately, not averaged into one number?
Is disagreement between reviewers measured and addressed?

Methodology

Data checked: 2026-05-25
Sources consulted: Anthropic and OpenAI evaluation guides, academic literature on inter-rater reliability (Cohen 1960), and human-AI interaction design principles (Amershi et al. 2019).
Assumptions: Inter-rater agreement is necessary but not sufficient — reviewers can consistently apply a bad rubric. Human evaluation is subject to fatigue, anchoring and drift over time.
Limitations: This guide provides a general rubric framework. Domain-specific rubrics (medical, legal, financial) require adaptation. For high-stakes domains, domain-expert reviewers are not optional.
Jurisdiction: Global. Human evaluation methodology is jurisdiction-agnostic; specific regulatory requirements for human oversight of AI systems may apply under the EU AI Act or sector-specific rules.

Source list

Anthropic evaluation guide — https://docs.anthropic.com/en/docs/evaluate (accessed 2026-05-25)
OpenAI Evals — https://platform.openai.com/docs/guides/evals (accessed 2026-05-25)
Cohen, “A Coefficient of Agreement for Nominal Scales,” Educational and Psychological Measurement, 1960 — https://doi.org/10.1177/001316446002000104 (accessed 2026-05-25)
Amershi et al., “Guidelines for Human-AI Interaction,” CHI 2019 — https://dl.acm.org/doi/10.1145/3290605.3300233 (accessed 2026-05-25)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: full editorial review — added Editor’s Notes, Trust Stack, slugified heading IDs, formal Methodology and Source list sections, renamed “Related reading” to “Related guides”, fixed writtenBy label (editor)
2026-05-25: first published