Track 4 · Advanced · Lesson 4

Knowledge distillation III: did it work? Quality retained

After this lesson you can interpret the quality-retained ratio, weigh it against the student's size and speed advantage, and decide whether a distilled model is worth shipping over the teacher or a plain SFT student.

Level: advanced Read time: ~9 min Prerequisites: Knowledge distillation II: the KD loss

Distillation only matters if the student actually inherits the teacher's skill. So the evaluation isn't "is the student good?" in isolation — it's "how much of the teacher did the student keep, and is that enough given how much smaller it is?"

The quality-retained ratio

BrewSLM's student–teacher comparison runs the same evaluation on both models and reports:

quality_retained = student_score / teacher_score

# example, on the same gold set:
#   teacher (7B)   accuracy = 0.94
#   student (135M) accuracy = 0.91
#   quality_retained = 0.91 / 0.94 = 0.97   →  97% of the teacher, at ~2% of the size

A ratio near 1.0 means the student kept almost all of the teacher's quality. The decision is then simple economics: 97% of the quality at a fraction of the size, memory, and latency is usually an easy yes.

From Track 3

This is the delta-from-baseline idea again, but the baseline is the teacher. Where Track 2/3 asked "did fine-tuning beat the base model?", distillation asks "did the student keep up with the teacher?" Same discipline: always measure against the thing you're trying to match.

What counts as 'enough'?

There's no universal threshold — it depends on the constraint that sent you to distillation:

Distill, or just SFT the small model?

The honest comparison is three-way: the teacher, a plain-SFT student, and a distilled student. Distillation earns its complexity only if the distilled student beats the plain-SFT student of the same size. If SFT alone gets you the same quality, you don't need the teacher. The comparison surface makes that explicit so you don't add machinery for nothing.

Honest metrics

Report the ratio that the eval actually produced, including where the student falls short of the teacher. A distilled model that retains 0.85 on the cases you care about is a 0.85 model — round it up and you'll be surprised in production.

That completes distillation: capture the teacher (4.2), train with the KD loss (4.3), and measure quality retained (4.4). Next we change the objective entirely — aligning a model to human preferences.

Key idea

quality_retained = student / teacher on the same eval is the verdict on a distillation. Near-1.0 at a fraction of the size is the win. Judge 'enough' by your constraint, and only prefer distillation over plain SFT when the distilled student actually beats the same-size SFT student.

Key terms

quality retained
student_score / teacher_score on the same evaluation — the fraction of the teacher's quality the student kept.
student–teacher comparison
Running the same eval on both models to compute quality retained.
size-quality tradeoff
Weighing the student's smaller size / lower latency against any quality lost.
distill-vs-SFT check
Confirming the distilled student beats a plain-SFT student of the same size before accepting the extra complexity.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.