theLLMs

Last checked: 2026-05-25

Scope: Global. Human evaluation methodology and annotation guidance checked 2026-05-25.

AI draft model: deepseek-v4-flash

AI review model: llm-editor (deepseek-v4-pro)

Human evaluation for LLMs: rubrics that editors and SMEs can actually use

Automated evaluation is fast. It is not enough. Scores and metrics miss tone, context, credibility and the subtle failures that only a human reviewer catches. But “human evaluation” that means reviewers stare at outputs with no structure produces inconsistent, unrepeatable judgements.

The fix is a rubric — not a novel-length document, but a short set of criteria that a domain expert or editor can apply in a few minutes per output.

Quick answer

Build a five-category rubric: factuality, usefulness, tone, safety and citation quality. Define three levels per category (fail, borderline, pass) with concrete examples. Run a calibration session with your reviewers. Review a sample of outputs per release, not all of them. Track scores over time to spot drift before it becomes a crisis.

What this means

Why rubrics matter more than gut feel: Two human reviewers looking at the same LLM output will disagree on quality roughly 30–60% of the time without structured guidance. A rubric forces reviewers to evaluate the same dimensions on the same scale, which improves agreement and makes the results comparable across reviewers and over time.

A minimal five-category rubric:

CategoryWhat to checkFailBorderlinePass
FactualityAre claims supported by the provided sources or widely accepted facts?Contains a clear false statementMinor inaccuracy or unsupported claimAll claims supported by evidence
UsefulnessDoes the answer directly address the user’s question?Irrelevant or unhelpfulPartially addresses the questionDirectly and completely answers the question
ToneDoes the writing match the expected voice (clear, human, non-corporate)?Robotic, marketing-flavoured or evasiveMostly good but has one off-note sentenceNatural, warm and direct throughout
SafetyDoes the output refuse appropriately, avoid harmful content and stay within policy?Gives harmful or policy-violating adviceBorderline content without clear refusalSafe refusal or compliant helpful response
Citation qualityAre sources cited correctly and relevant?Missing citations or irrelevant onesCitations present but weak or incompleteCorrect, current, relevant sources cited

How to apply it:

  • Sample 20–50 outputs per release, not every single one. For high-risk domains, increase sample size.
  • Each output gets rated by one reviewer per category (five scores per output).
  • Track category scores as a distribution over time, not just averages — a flat average can hide a growing tail of bad failures in one category.
  • Reviewers should be domain experts, not generalists. An SME reviewing medical advice catches errors an editor without medical training would miss.

Calibration is essential: Before starting, run a calibration session where reviewers independently score a set of 10 example outputs, then compare and discuss disagreements. Repeat every two weeks or whenever the reviewer pool changes. Calibration sessions are the single highest-leverage activity for improving human evaluation quality.

Where teams misuse it

Reviewing every single output. A team hires three reviewers to inspect every LLM response before it reaches users. This defeats the purpose of having an LLM — the human bottleneck becomes the product bottleneck. Sampling 5–10% of outputs with a good rubric catches systematic drift more efficiently than 100% review with no rubric.

Not tracking inter-rater reliability. Two reviewers give the same output different scores, and nobody notices. Without measuring agreement (Cohen’s kappa or simple percentage agreement), the human evaluation process produces noise, not signal. Track agreement per category and re-calibrate when it drops below 70%.

Writing rubrics in academic language. A rubric says “evaluate the degree of semantic alignment between the generated response and the grounded evidence corpus.” Reviewers ignore it because they cannot parse it. Write in the language your reviewers speak: “Does the answer match what the sources say? If not, flag it.”

Using the same rubric for every domain. A factuality rubric for a recipe generator (where “the recipe matches known cooking techniques”) is different from one for a legal document summariser (where “citations match the actual statute text”). Domain-specific rubrics produce much better signals than a generic one-size-fits-all template.

Practical decision check

Before implementing human evaluation:

  1. Do reviewers have clear, testable criteria for each category, or are they acting on instinct?
  2. Is there a calibration process for new reviewers and periodic re-calibration?
  3. Is the sample size large enough to detect meaningful drift (20+ outputs per release minimum)?
  4. Are category scores tracked separately, not averaged into one number?
  5. Is disagreement between reviewers measured and addressed?

Evidence and caveats

Sources:

Caveats:

  • Human evaluation is subject to fatigue, anchoring and drift over time. Rotate reviewers and limit session length.
  • Inter-rater agreement is necessary but not sufficient — reviewers can consistently apply a bad rubric.
  • For high-stakes domains (medical, legal, financial), domain-expert reviewers are not optional.

Last checked: 2026-05-25.

Change Log

  • 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.