Track 4 · Advanced · Lesson 2

Knowledge distillation I: the teacher and capturing its logits

After this lesson you can explain why a teacher's soft probabilities carry more signal than hard labels, why BrewSLM captures distillation offline as top-k logprobs, and the same-tokenizer constraint that makes it work.

Level: advanced Read time: ~11 min Prerequisites: Beyond a single SFT run: the advanced toolkit

You can train a 135M model to do a task. But what if a 7B model does it noticeably better, and you still need the small one's speed and cost? Knowledge distillation closes that gap: the big teacher trains the small student to imitate it, transferring more than a label file ever could.

Why mimic, instead of just using more labels?

A hard label says "this review is positive" — one bit. The teacher's full output is a probability distribution over the vocabulary: "92% positive, 7% neutral, 1% negative." That distribution encodes the teacher's uncertainty and its sense of how classes relate — the "dark knowledge" a one-hot label throws away. Training the student to match the distribution (its soft targets) transfers far more signal per example than the label alone.

From Track 1

Recall that the model's final layer produces logits, and softmax turns them into probabilities. Distillation supervises the student on the teacher's softmaxed logits, not just the gold token — the same machinery, a richer target.

Offline distillation: run the teacher once

The teacher is big and slow. Running it on every training step (online distillation) is expensive. BrewSLM does offline distillation: run the teacher across your dataset once, save what it produced, then train the student against that saved file as many times as you like. The capture step is a background Job:

POST /api/projects/<id>/distillation/capture     → starts a background Job
# writes one row per example to:
#   data/projects/<id>/distillation/teacher_capture.jsonl

Capture top-k logprobs, not the whole vocabulary

Saving the teacher's full distribution would mean a probability for every one of ~50,000 vocabulary tokens, per position — enormous. Almost all of that mass sits on a handful of tokens, so the capture keeps only the top-k per position: the k most likely token ids and their log-probabilities.

{ "input_ids":      [1, 4521, 318, ...],
  "topk_ids":       [[2762, 318, 257], ...],   # k most likely tokens / position
  "topk_logprobs":  [[-0.08, -2.9, -4.1], ...] # teacher log-probs for those k
}

Top-k captures the meaningful part of the distribution at a fraction of the size, and the student is trained to match the teacher on those k tokens.

The same-tokenizer assumption

Offline KD here assumes teacher and student share a tokenizer. The captured topk_ids are positions in a vocabulary; if the student's vocabulary differs, those ids mean different tokens and the soft targets are gibberish. Pick a teacher and student from the same family (same tokenizer) — e.g. a larger and a smaller model of one series.

With the teacher's knowledge frozen into teacher_capture.jsonl, the student never needs the teacher again. The next lesson is the loss that actually transfers that knowledge — the KD objective.

Key idea

Distillation trains a small student on the teacher's soft targets — the full probability distribution, which carries 'dark knowledge' hard labels lack. BrewSLM captures it offline as top-k logprobs so the costly teacher runs once, under a same-tokenizer assumption so the token ids line up.

Key terms

knowledge distillation
Training a small student model to reproduce a larger teacher model's outputs.
teacher / student
The large source model and the small model being trained to mimic it.
soft targets
The teacher's full probability distribution over tokens — a richer signal than a hard label.
dark knowledge
The relational information in a soft distribution (which wrong answers are plausible) that one-hot labels discard.
top-k logprobs
The k most likely tokens and their log-probabilities per position; a compact capture of the teacher's distribution.
offline distillation
Running the teacher once to a capture file, then training the student against it repeatedly.
same-tokenizer assumption
Teacher and student must share a vocabulary so captured token ids mean the same thing.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.