Knowledge distillation III: did it work? Quality retained
After this lesson you can interpret the quality-retained ratio, weigh it against the student's size and speed advantage, and decide whether a distilled model is worth shipping over the teacher or a plain SFT student.
Distillation only matters if the student actually inherits the teacher's skill. So the evaluation isn't "is the student good?" in isolation — it's "how much of the teacher did the student keep, and is that enough given how much smaller it is?"
The quality-retained ratio
BrewSLM's student–teacher comparison runs the same evaluation on both models and reports:
quality_retained = student_score / teacher_score
# example, on the same gold set:
# teacher (7B) accuracy = 0.94
# student (135M) accuracy = 0.91
# quality_retained = 0.91 / 0.94 = 0.97 → 97% of the teacher, at ~2% of the size
A ratio near 1.0 means the student kept almost all of the teacher's quality. The decision is then simple economics: 97% of the quality at a fraction of the size, memory, and latency is usually an easy yes.
From Track 3
This is the delta-from-baseline idea again, but the baseline is the teacher. Where Track 2/3 asked "did fine-tuning beat the base model?", distillation asks "did the student keep up with the teacher?" Same discipline: always measure against the thing you're trying to match.
What counts as 'enough'?
There's no universal threshold — it depends on the constraint that sent you to distillation:
- Tight latency / edge device: you might accept 0.90 retained for a 50× smaller model that runs on a phone.
- Quality-critical: you might need 0.98+ before replacing the teacher, and otherwise keep the teacher for the hard cases.
- Cost at scale: even 0.95 retained can save enormous serving cost across millions of calls.
Distill, or just SFT the small model?
The honest comparison is three-way: the teacher, a plain-SFT student, and a distilled student. Distillation earns its complexity only if the distilled student beats the plain-SFT student of the same size. If SFT alone gets you the same quality, you don't need the teacher. The comparison surface makes that explicit so you don't add machinery for nothing.
Honest metrics
Report the ratio that the eval actually produced, including where the student falls short of the teacher. A distilled model that retains 0.85 on the cases you care about is a 0.85 model — round it up and you'll be surprised in production.
That completes distillation: capture the teacher (4.2), train with the KD loss (4.3), and measure quality retained (4.4). Next we change the objective entirely — aligning a model to human preferences.
Key idea
quality_retained = student / teacher on the same eval is the verdict on a distillation. Near-1.0 at a fraction of the size is the win. Judge 'enough' by your constraint, and only prefer distillation over plain SFT when the distilled student actually beats the same-size SFT student.
Key terms
- quality retained
- student_score / teacher_score on the same evaluation — the fraction of the teacher's quality the student kept.
- student–teacher comparison
- Running the same eval on both models to compute quality retained.
- size-quality tradeoff
- Weighing the student's smaller size / lower latency against any quality lost.
- distill-vs-SFT check
- Confirming the distilled student beats a plain-SFT student of the same size before accepting the extra complexity.
Check yourself
Answers are saved to this browser.