Track 2 · Hands-on · Lesson 16

LLM-as-a-judge: scoring free-form outputs

After this lesson you can decide when LLM-as-a-judge is the right tool (free-form outputs, no exact-match metric), pick pairwise over absolute when reliability matters, write a JSON-schema rubric that constrains the judge to a checkable shape, anticipate the three classic biases (position, length, style), pick a judge model that isn't the model you're testing, and calibrate against a small human-rated set before you trust any aggregate.

Level: intermediate Read time: ~11 min Prerequisites: Real metrics with sklearn & HF evaluate

Classification has accuracy. Extraction has F1. JSON has a parse rate. But the moment your model emits a paragraph of explanation, a multi-turn dialogue, or a long-form summary, those metrics give you nothing useful — and graders by hand stop scaling somewhere around the third candidate model. LLM-as-a-judge fills that gap: a strong model reads a rubric and scores the output. It is the standard practice for free-form eval today. It is also the easiest place in the pipeline to lie to yourself, because every step of it can drift and the failure modes are quiet.

When to reach for it (and when not)

LLM-as-a-judge is the right tool when:

It is the wrong tool for anything that has a labelled gold answer. Classification, NER, JSON extraction, tool-call generation, retrieval relevance with a ground-truth label — use the deterministic metric. Judges cost more and bring biases you don't need.

Two judging modes

Pairwise A/B (preferred)

Show the judge a prompt and two candidate outputs. Ask which is better, given the rubric. Aggregate win-rates across many prompts. This is the more reliable mode — discriminating between two outputs is a task LLMs do well; the comparative signal compounds, and you don't need a stable absolute scale.

The cost: position bias — judges have a measurable preference for "A" or "B" regardless of content. The mitigation is mandatory: for every pair, ask the judge twice with the order swapped, and count a win only when the judge picks the same candidate in both orders. Half the runs is a tie; the other half is your signal.

Absolute single-output scoring

Show the judge one output and ask for a numeric score (1–5, 0–10, etc.) against the rubric. Easier to aggregate — averages, distributions, gates — but the scale is unstable: drift in the rubric, drift in the judge model, drift in your own intuition over time. Use absolute scoring when you need a one-shot gate ("does the output clear the bar?") or to track a single model over time, not to compare models against each other.

The JSON-schema rubric pattern

The pattern that makes judging machine-checkable: write the rubric as a JSON schema the judge must return. Free-text justifications are fine and worth keeping (you read them during calibration), but the numeric/categorical fields are what aggregates over the run.

# pairwise judge prompt (sketch)
RUBRIC = """
You are evaluating which of two customer-support replies is better.
Rate each on three dimensions, then pick a winner.

Dimensions (each 1-5):
  - correctness: does it answer the customer's actual question?
  - tone: is it polite without being syrupy or evasive?
  - concision: as short as possible without losing necessary detail.

Return ONLY a JSON object matching this schema. No prose outside it.

{
  "winner": "A" | "B" | "tie",
  "scores": {
    "A": {"correctness": int, "tone": int, "concision": int},
    "B": {"correctness": int, "tone": int, "concision": int}
  },
  "reasoning": "one sentence on why the winner won"
}
"""

# render and call twice with order swapped
for (order, a, b) in [("forward", out_x, out_y), ("reverse", out_y, out_x)]:
    judgement = judge_model.chat(RUBRIC + render_pair(prompt, a, b))
    parsed = json.loads(judgement)
    record(order, parsed)

# count a win only when both orders agree
win = forward.winner == reverse.winner and forward.winner != "tie"

The constraints that matter in this pattern: JSON only, no prose, fixed schema, swap-order check, structured numeric fields for aggregation. Skip any of these and you're hand-cleaning judge output for the rest of the project.

The three classic biases

Judge-model selection

The rules of thumb that hold up:

Calibration against a human-rated set

This is the step everyone skips and then everyone regrets. The judge's score is not the ground truth — it's a proxy. The calibration cadence:

  1. Take 50–200 pairs from your eval set.
  2. Have 2–3 humans (you, a colleague, a contractor — anyone with the domain context) independently rate each pair against the same rubric the judge is using. Resolve disagreements; this becomes your human-rated gold-of-the-gold.
  3. Run the judge over the same pairs.
  4. Compute inter-judge agreement (judge vs human) — Cohen's κ for categorical, Spearman ρ for ordinal. Anything below κ ≈ 0.4 / ρ ≈ 0.5 means the judge is not measuring what you think it's measuring; revise the rubric, swap the judge, or accept that this dimension needs human evaluation.
  5. Pin the human set. Re-run it whenever you change the rubric, the judge model, or the candidate population.

Honest beat — LLM judges agree with each other more than they agree with humans

Two judge models will often correlate 0.8+ with each other while correlating only 0.4–0.6 with human raters. Reporting "two judges agreed on the same winner" sounds rigorous and is often a vanity statistic; both judges are sharing the same length / style / position biases. The numbers that matter are judge-vs-human agreement on your calibration set. If you don't have that, you don't have an eval — you have an aesthetic.

Aggregation and reporting

Once judging is calibrated, the honest report has three numbers:

Key idea

LLM-as-a-judge is the practical answer to "how do I score free-form outputs at scale?" — but it's a measurement device, not the truth. Prefer pairwise A/B with order-swapping, write the rubric as JSON-schema-constrained, never use a judge from the same family as a candidate, and calibrate against humans before you trust the aggregate. A judge win-rate without a calibration agreement is half a number.

The next lesson covers public benchmarks: a different (and complementary) eval layer that catches catastrophic regressions a task-specific judge would never see.

Key terms

LLM-as-a-judge
The practice of using a strong language model to score the outputs of another model against a written rubric, used when exact-match metrics don't apply.
Rubric
The written criteria the judge scores against — dimensions (correctness, tone, concision, etc.) and a scale. The clearer the rubric, the more consistent the judge.
Pairwise judging (A/B)
Showing the judge two outputs and asking which is better. More reliable than absolute scoring because discriminative judgments compound; introduces position bias which order-swapping fixes.
Position bias
The judge's preference for the first (or sometimes second) option presented, independent of content. Mitigated by running every pair in both orders and counting only agreed-on wins.
Length bias
The judge's tendency to prefer the longer answer, treating verbosity as effort. Mitigated by including concision in the rubric and checking length / win-rate correlation.
Style / family bias
The judge's tendency to prefer outputs from the same model family it was trained from. Mitigated by judge selection — never reuse a candidate's family.
JSON-schema rubric
The pattern of requiring the judge to return a fixed-schema JSON object so per-dimension scores are machine-aggregatable. No free-text outside the schema.
Judge calibration
Comparing judge scores against a small human-rated set (50–200 pairs) and reporting the agreement (Cohen's κ, Spearman ρ). Required to interpret any aggregate judge metric.
Inter-judge agreement
Correlation between two judge models. Always higher than judge-vs-human; reporting it instead of human calibration is a vanity metric.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.