LLM-as-a-judge: when automated grading helps and when it lies

Using a strong model to evaluate the outputs of another model is the most popular quality-check shortcut in the LLM product world. Write a grading prompt, feed it the input and the candidate output, and get a score. Fast, cheap, scalable.

The output looks authoritative. It is not.

LLM-as-a-judge suffers from position bias, verbosity bias, self-enhancement bias and a tendency to reward style over substance. Used well, it catches coarse regressions. Used as a universal quality gate, it creates confident but wrong decisions.

TL;DR

Use LLM-as-a-judge for coarse regression detection and relative comparisons between prompt or model versions — “did the new prompt tank the pass rate?” — not for absolute quality measurement. Always validate the judge’s decisions against human review on a sample before trusting the scores. If the judge consistently prefers longer, more confident or more verbose outputs regardless of correctness, switch to a narrower assertion-based test.

What this means

How LLM-as-a-judge works: You give a grading model a scoring prompt — a rubric, the original user input, the model’s output, sometimes a reference answer — and ask it to return a score (1–5, pass/fail, etc.) with or without a justification. The most common setups use GPT-4, Claude or Gemini as the judge, evaluating outputs from smaller or specialised models.

Known biases:

Position bias: Given two outputs to compare (A vs B), the judge prefers whichever appears first approximately 50–65% of the time, independent of quality. This is well-documented in evaluation literature and means that A/B comparisons where you always present options in the same order are structurally biased.
Verbosity bias: The judge assigns higher scores to longer, more detailed outputs — even when the extra detail is wrong or irrelevant. An incorrect but elaborate answer often scores higher than a correct but concise one.
Self-enhancement bias: The judge prefers outputs that match its own style. If GPT-4 is the judge and the candidate model produces GPT-4-like prose (bulleted lists, cautious hedging, structured format), it scores higher than an equally correct but differently structured answer.
Rubric neglect: The judge ignores or misapplies the rubric when the scoring prompt is long or complex, falling back to a general “does this sound good?” heuristic. The more detailed the rubric, the more likely the judge is to ignore parts of it.
Calibration blindness: The judge’s confidence in its own score is not related to the accuracy of that score. A model that scores 95% in self-evaluation may only be correct 70% of the time when checked against human judgement.

What mitigates them:

Randomise the order of options being compared (swap A and B, run twice, discard ties).
Include a “tie” option so the judge does not have to force a preference.
Use direct scoring (score individual outputs against a rubric) instead of pairwise comparison to reduce position effects.
Validate judge scores against human review on a held-out sample of 50–100 cases.
Narrow the judge’s role: instead of “rate quality 1–5,” ask “does the output contain any factual error? Answer yes or no.”

For guidance on setting up a human review panel with clear rubrics that non-technical editors and SMEs can apply consistently, see our guide on human evaluation for LLMs.

Where teams misuse it

Using LLM-as-a-judge as the only evaluation method. A team deploys a GPT-4-as-judge pipeline, sees a 92% pass rate, and considers quality verified. No human review, no cross-validation, no catch of systematic biases. Months later, user complaints reveal that the judge was rewarding verbose, confident-sounding wrong answers while penalising short, cautious-but-correct ones. The 92% pass rate was not a quality signal; it was a style-matching score.

Asking the judge to score something it cannot evaluate. A team asks an LLM judge to rate “regulatory compliance” of financial advice outputs without giving the judge the regulation text. The judge guesses based on general knowledge, assigns high scores to outputs that sound compliant but are not, and misses violations that a human reviewer with the regulation text would catch immediately. The judge can only evaluate against criteria that are in the scoring prompt — if the criteria are not fully specified, the judge fills the gaps with its own training data.

Not detecting when the judge starts failing. A prompt change or model update shifts the judge’s own behaviour, making it more lenient or stricter. The evaluation pipeline keeps running, producing scores that drift away from human judgement — but nobody is checking because the pipeline is “automated.” Regular cross-validation catches this. Without it, the evaluation silently decays.

Practical decision check

Before relying on an LLM judge score:

Did you run a calibration check recently — comparing judge scores against human review on a fresh sample?
Is the scoring prompt short enough that rubric neglect is unlikely (under 500 words)?
Are you measuring relative change (pass rate went from 85% to 82%) rather than absolute quality?
Have you tested for position bias by reversing order and checking consistency?
Does the assessment only evaluate things the judge can actually assess from the prompt and output (no external knowledge or regulation text required)?

Methodology

Data checked: 2026-05-25
Sources consulted: Academic literature on LLM-as-judge methodology (Zheng et al. 2023, Wang et al. 2023), evaluation tool documentation (Promptfoo, DeepEval), and known bias studies.
Assumptions: Position bias varies by model and prompt template. LLM-as-a-judge is useful for screening (flagging likely failures) but not for certification (confirming quality). If you cannot validate against human judgement, you cannot trust the absolute scores.
Limitations: This guide covers known biases and mitigation strategies. It does not provide a specific judge-model recommendation or a production-ready evaluation pipeline. Judge behaviour may change with model updates.
Jurisdiction: Global. Evaluation methodology is jurisdiction-agnostic; specific regulatory requirements for automated decision-making oversight may apply under the EU AI Act or sector-specific rules.

Source list

Zheng et al., “Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena,” NeurIPS 2023 — https://arxiv.org/abs/2306.05685 (accessed 2026-05-25)
Wang et al., “Large Language Models are Not Fair Evaluators,” arXiv:2305.17926, 2023 — https://arxiv.org/abs/2305.17926 (accessed 2026-05-25)
Promptfoo LLM-as-judge documentation — https://www.promptfoo.dev/docs/guides/llm-as-judge/ (accessed 2026-05-25)
DeepEval LLM-as-judge — https://docs.confident-ai.com/docs/metrics-llm-eval (accessed 2026-05-25)

Trust Stack

Last checked: 2026-05-28
Corrections: Contact us to report errors

Change log

2026-05-28: full editorial review — added Editor’s Notes, Trust Stack, slugified heading IDs, formal Methodology and Source list sections, renamed “Related reading” to “Related guides”, fixed writtenBy label (editor)
2026-05-25: first published