Evaluation that matches the task
After this lesson you can design an evaluation for your task: pick a shape-appropriate metric, run it on the gold set against a baseline, set a pass gate, and report results honestly.
Training produces a model; evaluation tells you whether it's any good. This capstone of Track 1 ties the whole loop together: a fine-tune isn't finished when the loss flattens — it's finished when a metric on the gold set clears a bar you set in advance.
Pick a metric that fits the shape
From the task-shapes lesson: the metric must match the shape, or the number is meaningless. Quick map:
- Classification → accuracy, and per-class precision / recall / F1 (plus a confusion matrix to see which classes confuse).
- Extraction / NER → span-set precision/recall/F1 on exact
(type, span)matches. - QA → exact-match / F1 against references, or an LLM judge for open answers.
- Summarization → ROUGE for overlap plus a faithfulness check.
- Structured output → valid-format rate + per-field correctness.
A word on precision vs recall, since they recur everywhere: precision is "of what the model flagged, how much was right"; recall is "of what it should have found, how much it caught." F1 is their harmonic mean — one number when you care about both. Which matters more is a product decision (a PII detector usually prizes recall; a spam filter, precision).
Always compare to a baseline
A number alone is not evidence. "Is it better than what?" Evaluate the base model (before fine-tuning) on the same gold set, and report the fine-tuned model's score next to it. The lift over baseline is the thing your training actually bought. Other useful baselines: a simple prompt on a larger model, or the previous deployed version.
Gates: decide the bar in advance
A quality gate is a threshold you commit to before seeing the result ("ship only if gold-set F1 ≥ 0.85 and no safety regressions"). Setting it in advance stops you from rationalizing a mediocre model after the fact. Gates turn evaluation from a vibe into a decision rule.
Honest reporting (the rule that matters most)
Report where the model loses, not just where it wins. Show the failing class, the slice with low recall, the examples it gets wrong. An evaluation that only surfaces good news is worse than none — it builds false confidence. The point of measuring is to find what to fix next.
Evaluate on the gold set, then iterate
Run the metric on the held-out gold set (never the training data), read the failures, and feed them back into the data-centric loop (Track 0, Lesson 9): the classes and slices where the model loses tell you exactly which examples to add or fix before the next run.
That completes Track 1. You now understand why and what for every part of supervised fine-tuning — objectives, data, the loss mask and chat templates, the training loop and its loss, learning rate and batch, LoRA and memory, and how to read and evaluate the result. In Track 2 you'll do all of it by hand in PyTorch and Transformers, turning this understanding into a working fine-tune.
Key terms
- Evaluation metric
- A shape-appropriate measure of quality (accuracy/F1, span-set F1, ROUGE + faithfulness, etc.).
- Precision / recall / F1
- Of flagged, how much right / of what should be found, how much caught / their harmonic mean.
- Baseline
- A reference score (base model, prompt, or previous version) the fine-tune is compared against.
- Quality gate
- A pass threshold set in advance that a model must clear to ship.
- Honest reporting
- Surfacing where the model loses, not only where it wins, to guide the next iteration.
Check yourself
Answers are saved to this browser.