Knowledge distillation I: the teacher and capturing its logits
After this lesson you can explain why a teacher's soft probabilities carry more signal than hard labels, why BrewSLM captures distillation offline as top-k logprobs, and the same-tokenizer constraint that makes it work.
You can train a 135M model to do a task. But what if a 7B model does it noticeably better, and you still need the small one's speed and cost? Knowledge distillation closes that gap: the big teacher trains the small student to imitate it, transferring more than a label file ever could.
Why mimic, instead of just using more labels?
A hard label says "this review is positive" — one bit. The teacher's full output is a probability distribution over the vocabulary: "92% positive, 7% neutral, 1% negative." That distribution encodes the teacher's uncertainty and its sense of how classes relate — the "dark knowledge" a one-hot label throws away. Training the student to match the distribution (its soft targets) transfers far more signal per example than the label alone.
From Track 1
Recall that the model's final layer produces logits, and softmax turns them into probabilities. Distillation supervises the student on the teacher's softmaxed logits, not just the gold token — the same machinery, a richer target.
Offline distillation: run the teacher once
The teacher is big and slow. Running it on every training step (online distillation) is expensive. BrewSLM does offline distillation: run the teacher across your dataset once, save what it produced, then train the student against that saved file as many times as you like. The capture step is a background Job:
POST /api/projects/<id>/distillation/capture → starts a background Job
# writes one row per example to:
# data/projects/<id>/distillation/teacher_capture.jsonl
Capture top-k logprobs, not the whole vocabulary
Saving the teacher's full distribution would mean a probability for every one of ~50,000 vocabulary tokens, per position — enormous. Almost all of that mass sits on a handful of tokens, so the capture keeps only the top-k per position: the k most likely token ids and their log-probabilities.
{ "input_ids": [1, 4521, 318, ...],
"topk_ids": [[2762, 318, 257], ...], # k most likely tokens / position
"topk_logprobs": [[-0.08, -2.9, -4.1], ...] # teacher log-probs for those k
}
Top-k captures the meaningful part of the distribution at a fraction of the size, and the student is trained to match the teacher on those k tokens.
The same-tokenizer assumption
Offline KD here assumes teacher and student share a tokenizer. The captured topk_ids are positions in a vocabulary; if the student's vocabulary differs, those ids mean different tokens and the soft targets are gibberish. Pick a teacher and student from the same family (same tokenizer) — e.g. a larger and a smaller model of one series.
With the teacher's knowledge frozen into teacher_capture.jsonl, the student never needs the teacher again. The next lesson is the loss that actually transfers that knowledge — the KD objective.
Key idea
Distillation trains a small student on the teacher's soft targets — the full probability distribution, which carries 'dark knowledge' hard labels lack. BrewSLM captures it offline as top-k logprobs so the costly teacher runs once, under a same-tokenizer assumption so the token ids line up.
Key terms
- knowledge distillation
- Training a small student model to reproduce a larger teacher model's outputs.
- teacher / student
- The large source model and the small model being trained to mimic it.
- soft targets
- The teacher's full probability distribution over tokens — a richer signal than a hard label.
- dark knowledge
- The relational information in a soft distribution (which wrong answers are plausible) that one-hot labels discard.
- top-k logprobs
- The k most likely tokens and their log-probabilities per position; a compact capture of the teacher's distribution.
- offline distillation
- Running the teacher once to a capture file, then training the student against it repeatedly.
- same-tokenizer assumption
- Teacher and student must share a vocabulary so captured token ids mean the same thing.
Check yourself
Answers are saved to this browser.