When is LLM-as-a-judge the right tool?

When the output is free-form (summaries, explanations, multi-turn dialogue) and there's no exact-match metric. For classification, JSON extraction, or anything with a labelled gold answer, use the deterministic metric instead — it's cheaper and not biased.

Why is pairwise A/B more reliable than absolute scoring?

Asking 'which is better, A or B?' gives a discriminative signal that compounds well; asking 'rate this 1-5' invites style/length bias and drifts in calibration over time. Pairwise still has position bias — always randomise the order — but it removes the absolute-scale instability.

What's the honest beat about LLM judges?

LLM judges agree with each other more than they agree with humans — two judge models share the same training-data biases, and reporting only their agreement is a vanity metric. Calibration against a small human-rated set is required before treating a judge score as ground truth.

Track 2 · Hands-on · Lesson 16

LLM-as-a-judge: scoring free-form outputs

After this lesson you can decide when LLM-as-a-judge is the right tool (free-form outputs, no exact-match metric), pick pairwise over absolute when reliability matters, write a JSON-schema rubric that constrains the judge to a checkable shape, anticipate the three classic biases (position, length, style), pick a judge model that isn't the model you're testing, and calibrate against a small human-rated set before you trust any aggregate.

Level: intermediate Read time: ~11 min Prerequisites: Real metrics with sklearn & HF evaluate

Classification has accuracy. Extraction has F1. JSON has a parse rate. But the moment your model emits a paragraph of explanation, a multi-turn dialogue, or a long-form summary, those metrics give you nothing useful — and graders by hand stop scaling somewhere around the third candidate model. LLM-as-a-judge fills that gap: a strong model reads a rubric and scores the output. It is the standard practice for free-form eval today. It is also the easiest place in the pipeline to lie to yourself, because every step of it can drift and the failure modes are quiet.

When to reach for it (and when not)

LLM-as-a-judge is the right tool when:

The output is free-form: summaries, explanations, customer replies, multi-turn assistant turns, creative completions.
There's no exact-match metric you trust. Human variance on phrasing makes string-match useless.
You have more candidates than you can grade by hand — multiple training runs, multiple decoding configurations, an A/B over data variants.

It is the wrong tool for anything that has a labelled gold answer. Classification, NER, JSON extraction, tool-call generation, retrieval relevance with a ground-truth label — use the deterministic metric. Judges cost more and bring biases you don't need.

Two judging modes

Pairwise A/B (preferred)

Show the judge a prompt and two candidate outputs. Ask which is better, given the rubric. Aggregate win-rates across many prompts. This is the more reliable mode — discriminating between two outputs is a task LLMs do well; the comparative signal compounds, and you don't need a stable absolute scale.

The cost: position bias — judges have a measurable preference for "A" or "B" regardless of content. The mitigation is mandatory: for every pair, ask the judge twice with the order swapped, and count a win only when the judge picks the same candidate in both orders. Half the runs is a tie; the other half is your signal.

Absolute single-output scoring

Show the judge one output and ask for a numeric score (1–5, 0–10, etc.) against the rubric. Easier to aggregate — averages, distributions, gates — but the scale is unstable: drift in the rubric, drift in the judge model, drift in your own intuition over time. Use absolute scoring when you need a one-shot gate ("does the output clear the bar?") or to track a single model over time, not to compare models against each other.

The JSON-schema rubric pattern

The pattern that makes judging machine-checkable: write the rubric as a JSON schema the judge must return. Free-text justifications are fine and worth keeping (you read them during calibration), but the numeric/categorical fields are what aggregates over the run.

# pairwise judge prompt (sketch)
RUBRIC = """
You are evaluating which of two customer-support replies is better.
Rate each on three dimensions, then pick a winner.

Dimensions (each 1-5):
  - correctness: does it answer the customer's actual question?
  - tone: is it polite without being syrupy or evasive?
  - concision: as short as possible without losing necessary detail.

Return ONLY a JSON object matching this schema. No prose outside it.

{
  "winner": "A" | "B" | "tie",
  "scores": {
    "A": {"correctness": int, "tone": int, "concision": int},
    "B": {"correctness": int, "tone": int, "concision": int}
  },
  "reasoning": "one sentence on why the winner won"
}
"""

# render and call twice with order swapped
for (order, a, b) in [("forward", out_x, out_y), ("reverse", out_y, out_x)]:
    judgement = judge_model.chat(RUBRIC + render_pair(prompt, a, b))
    parsed = json.loads(judgement)
    record(order, parsed)

# count a win only when both orders agree
win = forward.winner == reverse.winner and forward.winner != "tie"

The constraints that matter in this pattern: JSON only, no prose, fixed schema, swap-order check, structured numeric fields for aggregation. Skip any of these and you're hand-cleaning judge output for the rest of the project.

The three classic biases

Position bias — judges favour the first (or sometimes second) option presented. Mitigate with mandatory order-swap and "both must agree" win counting.
Length bias — judges tend to prefer the longer answer when in doubt, treating verbosity as effort. Mitigate by including concision as an explicit rubric dimension and by checking the win-rate correlation against output length on a held-out slice — if a 1-point length difference systematically predicts the winner, you have length bias.
Style / family bias — judges trained by the same lab as one candidate tend to favour that candidate's style. The mitigation is judge selection: never use the same model family as either candidate, and ideally use two judges from different labs and only treat agreed-on wins as wins.

Judge-model selection

The rules of thumb that hold up:

The judge must be strictly stronger than every candidate. A weaker model judges by surface markers (length, hedging, formality) more than by content.
The judge must not be in the same family as any candidate — Gemini judges Gemini favourably, GPT judges GPT favourably, Claude judges Claude favourably. The pattern is robust enough that "what family is the judge in?" is the first question to ask of any LLM-as-judge result.
If your candidates are SLMs (135M–7B), a 70B-class judge is enough; you don't need a frontier model just to outclass them. The marginal precision past that point doesn't pay back the cost.
Pin the judge model and decoding config. A judge with temperature=0.7 will flip on the same prompt; use temperature=0 and pin the model version.

Calibration against a human-rated set

This is the step everyone skips and then everyone regrets. The judge's score is not the ground truth — it's a proxy. The calibration cadence:

Take 50–200 pairs from your eval set.
Have 2–3 humans (you, a colleague, a contractor — anyone with the domain context) independently rate each pair against the same rubric the judge is using. Resolve disagreements; this becomes your human-rated gold-of-the-gold.
Run the judge over the same pairs.
Compute inter-judge agreement (judge vs human) — Cohen's κ for categorical, Spearman ρ for ordinal. Anything below κ ≈ 0.4 / ρ ≈ 0.5 means the judge is not measuring what you think it's measuring; revise the rubric, swap the judge, or accept that this dimension needs human evaluation.
Pin the human set. Re-run it whenever you change the rubric, the judge model, or the candidate population.

Honest beat — LLM judges agree with each other more than they agree with humans

Two judge models will often correlate 0.8+ with each other while correlating only 0.4–0.6 with human raters. Reporting "two judges agreed on the same winner" sounds rigorous and is often a vanity statistic; both judges are sharing the same length / style / position biases. The numbers that matter are judge-vs-human agreement on your calibration set. If you don't have that, you don't have an eval — you have an aesthetic.

Aggregation and reporting

Once judging is calibrated, the honest report has three numbers:

Win-rate over the eval set (pairwise) or mean rubric score (absolute), with a confidence interval — bootstrap, not the standard error of a mean over correlated examples.
Per-dimension breakdown. The aggregate hides patterns. If your model wins on correctness but loses on concision, the user wants to know.
The calibration κ. Reporting a judge win-rate without the calibration agreement is reporting a number without its units.

Key idea

LLM-as-a-judge is the practical answer to "how do I score free-form outputs at scale?" — but it's a measurement device, not the truth. Prefer pairwise A/B with order-swapping, write the rubric as JSON-schema-constrained, never use a judge from the same family as a candidate, and calibrate against humans before you trust the aggregate. A judge win-rate without a calibration agreement is half a number.

The next lesson covers public benchmarks: a different (and complementary) eval layer that catches catastrophic regressions a task-specific judge would never see.

Key terms

LLM-as-a-judge: The practice of using a strong language model to score the outputs of another model against a written rubric, used when exact-match metrics don't apply.
Rubric: The written criteria the judge scores against — dimensions (correctness, tone, concision, etc.) and a scale. The clearer the rubric, the more consistent the judge.
Pairwise judging (A/B): Showing the judge two outputs and asking which is better. More reliable than absolute scoring because discriminative judgments compound; introduces position bias which order-swapping fixes.
Position bias: The judge's preference for the first (or sometimes second) option presented, independent of content. Mitigated by running every pair in both orders and counting only agreed-on wins.
Length bias: The judge's tendency to prefer the longer answer, treating verbosity as effort. Mitigated by including concision in the rubric and checking length / win-rate correlation.
Style / family bias: The judge's tendency to prefer outputs from the same model family it was trained from. Mitigated by judge selection — never reuse a candidate's family.
JSON-schema rubric: The pattern of requiring the judge to return a fixed-schema JSON object so per-dimension scores are machine-aggregatable. No free-text outside the schema.
Judge calibration: Comparing judge scores against a small human-rated set (50–200 pairs) and reporting the agreement (Cohen's κ, Spearman ρ). Required to interpret any aggregate judge metric.
Inter-judge agreement: Correlation between two judge models. Always higher than judge-vs-human; reporting it instead of human calibration is a vanity metric.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.