LLM-as-a-judge: when automated grading helps and when it lies
Using a strong model to evaluate the outputs of another model is the most popular quality-check shortcut in the LLM product world. Write a grading prompt, feed it the input and the candidate output, and get a score. Fast, cheap, scalable.
The output looks authoritative. It is not.
LLM-as-a-judge suffers from position bias, verbosity bias, self-enhancement bias and a tendency to reward style over substance. Used well, it catches coarse regressions. Used as a universal quality gate, it creates confident but wrong decisions.
Quick answer
Use LLM-as-a-judge for coarse regression detection and relative comparisons between prompt or model versions — “did the new prompt tank the pass rate?” — not for absolute quality measurement. Always validate the judge’s decisions against human review on a sample before trusting the scores. If the judge consistently prefers longer, more confident or more verbose outputs regardless of correctness, switch to a narrower assertion-based test.
What this means
How LLM-as-a-judge works: You give a grading model a scoring prompt — a rubric, the original user input, the model’s output, sometimes a reference answer — and ask it to return a score (1–5, pass/fail, etc.) with or without a justification. The most common setups use GPT-4, Claude or Gemini as the judge, evaluating outputs from smaller or specialised models.
Known biases:
- Position bias: Given two outputs to compare (A vs B), the judge prefers whichever appears first approximately 50–65% of the time, independent of quality. This is well-documented in evaluation literature and means that A/B comparisons where you always present options in the same order are structurally biased.
- Verbosity bias: The judge assigns higher scores to longer, more detailed outputs — even when the extra detail is wrong or irrelevant. An incorrect but elaborate answer often scores higher than a correct but concise one.
- Self-enhancement bias: The judge prefers outputs that match its own style. If GPT-4 is the judge and the candidate model produces GPT-4-like prose (bulleted lists, cautious hedging, structured format), it scores higher than an equally correct but differently structured answer.
- Rubric neglect: The judge ignores or misapplies the rubric when the scoring prompt is long or complex, falling back to a general “does this sound good?” heuristic. The more detailed the rubric, the more likely the judge is to ignore parts of it.
- Calibration blindness: The judge’s confidence in its own score is not related to the accuracy of that score. A model that scores 95% in self-evaluation may only be correct 70% of the time when checked against human judgement.
What mitigates them:
- Randomise the order of options being compared (swap A and B, run twice, discard ties).
- Include a “tie” option so the judge does not have to force a preference.
- Use direct scoring (score individual outputs against a rubric) instead of pairwise comparison to reduce position effects.
- Validate judge scores against human review on a held-out sample of 50–100 cases.
- Narrow the judge’s role: instead of “rate quality 1–5,” ask “does the output contain any factual error? Answer yes or no.”
Where teams misuse it
Using LLM-as-a-judge as the only evaluation method. A team deploys a GPT-4-as-judge pipeline, sees a 92% pass rate, and considers quality verified. No human review, no cross-validation, no catch of systematic biases. Months later, user complaints reveal that the judge was rewarding verbose, confident-sounding wrong answers while penalising short, cautious-but-correct ones. The 92% pass rate was not a quality signal; it was a style-matching score.
Asking the judge to score something it cannot evaluate. A team asks an LLM judge to rate “regulatory compliance” of financial advice outputs without giving the judge the regulation text. The judge guesses based on general knowledge, assigns high scores to outputs that sound compliant but are not, and misses violations that a human reviewer with the regulation text would catch immediately. The judge can only evaluate against criteria that are in the scoring prompt — if the criteria are not fully specified, the judge fills the gaps with its own training data.
Not detecting when the judge starts failing. A prompt change or model update shifts the judge’s own behaviour, making it more lenient or stricter. The evaluation pipeline keeps running, producing scores that drift away from human judgement — but nobody is checking because the pipeline is “automated.” Regular cross-validation catches this. Without it, the evaluation silently decays.
Practical decision check
Before relying on an LLM judge score:
- Did you run a calibration check recently — comparing judge scores against human review on a fresh sample?
- Is the scoring prompt short enough that rubric neglect is unlikely (under 500 words)?
- Are you measuring relative change (pass rate went from 85% to 82%) rather than absolute quality?
- Have you tested for position bias by reversing order and checking consistency?
- Does the assessment only evaluate things the judge can actually assess from the prompt and output (no external knowledge or regulation text required)?
Evidence and caveats
Sources:
- Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” NeurIPS 2023. https://arxiv.org/abs/2306.05685 — foundational paper on position and verbosity bias.
- Wang et al., “Large Language Models are Not Fair Evaluators,” arXiv:2305.17926, 2023 — self-enhancement bias documentation.
- Promptfoo LLM-as-judge documentation: https://www.promptfoo.dev/docs/guides/llm-as-judge/
- DeepEval LLM-as-judge: https://docs.confident-ai.com/docs/metrics-llm-eval
Caveats:
- Position bias varies by model and prompt template. Test your specific setup.
- LLM-as-judge is useful for screening (flagging likely failures) but not for certification (confirming quality).
- If you cannot validate against human judgement, you cannot trust the absolute scores.
Last checked: 2026-05-25.
Related reading
- Human evaluation for LLMs: rubrics that editors and SMEs can actually use
- Synthetic eval datasets: useful shortcut or false confidence?
- Eval CI for AI apps: testing prompts before every release
- Contamination and leakage: why benchmark scores can be too good
Change Log
- 2026-05-27: Added direct source URLs to all named providers and services; added Change Log section. Content unchanged.