Why are soft targets richer than hard labels?

They encode the teacher's uncertainty and how classes relate

Why does BrewSLM capture distillation offline?

So the expensive teacher runs once, then the student trains against the saved file repeatedly

Why store only top-k logprobs instead of the full distribution?

Almost all probability mass is on a few tokens, so top-k captures the signal at a fraction of the size

Why must teacher and student share a tokenizer in this offline KD?

So captured token ids refer to the same tokens for both models

Track 4 · Advanced · Lesson 2

Knowledge distillation I: the teacher and capturing its logits

After this lesson you can explain why a teacher's soft probabilities carry more signal than hard labels, why BrewSLM captures distillation offline as top-k logprobs, and the same-tokenizer constraint that makes it work.

Level: advanced Read time: ~11 min Prerequisites: Beyond a single SFT run: the advanced toolkit

You can train a 135M model to do a task. But what if a 7B model does it noticeably better, and you still need the small one's speed and cost? Knowledge distillation closes that gap: the big teacher trains the small student to imitate it, transferring more than a label file ever could.

Why mimic, instead of just using more labels?

A hard label says "this review is positive" — one bit. The teacher's full output is a probability distribution over the vocabulary: "92% positive, 7% neutral, 1% negative." That distribution encodes the teacher's uncertainty and its sense of how classes relate — the "dark knowledge" a one-hot label throws away. Training the student to match the distribution (its soft targets) transfers far more signal per example than the label alone.

From Track 1

Recall that the model's final layer produces logits, and softmax turns them into probabilities. Distillation supervises the student on the teacher's softmaxed logits, not just the gold token — the same machinery, a richer target.

Offline distillation: run the teacher once

The teacher is big and slow. Running it on every training step (online distillation) is expensive. BrewSLM does offline distillation: run the teacher across your dataset once, save what it produced, then train the student against that saved file as many times as you like. The capture step is a background Job:

POST /api/projects/<id>/distillation/capture     → starts a background Job
# writes one row per example to:
#   data/projects/<id>/distillation/teacher_capture.jsonl

Capture top-k logprobs, not the whole vocabulary

Saving the teacher's full distribution would mean a probability for every one of ~50,000 vocabulary tokens, per position — enormous. Almost all of that mass sits on a handful of tokens, so the capture keeps only the top-k per position: the k most likely token ids and their log-probabilities.

{ "input_ids":      [1, 4521, 318, ...],
  "topk_ids":       [[2762, 318, 257], ...],   # k most likely tokens / position
  "topk_logprobs":  [[-0.08, -2.9, -4.1], ...] # teacher log-probs for those k
}

Top-k captures the meaningful part of the distribution at a fraction of the size, and the student is trained to match the teacher on those k tokens.

The same-tokenizer assumption

Offline KD here assumes teacher and student share a tokenizer. The captured topk_ids are positions in a vocabulary; if the student's vocabulary differs, those ids mean different tokens and the soft targets are gibberish. Pick a teacher and student from the same family (same tokenizer) — e.g. a larger and a smaller model of one series.

With the teacher's knowledge frozen into teacher_capture.jsonl, the student never needs the teacher again. The next lesson is the loss that actually transfers that knowledge — the KD objective.

Key idea

Distillation trains a small student on the teacher's soft targets — the full probability distribution, which carries 'dark knowledge' hard labels lack. BrewSLM captures it offline as top-k logprobs so the costly teacher runs once, under a same-tokenizer assumption so the token ids line up.

Key terms

knowledge distillation: Training a small student model to reproduce a larger teacher model's outputs.
teacher / student: The large source model and the small model being trained to mimic it.
soft targets: The teacher's full probability distribution over tokens — a richer signal than a hard label.
dark knowledge: The relational information in a soft distribution (which wrong answers are plausible) that one-hot labels discard.
top-k logprobs: The k most likely tokens and their log-probabilities per position; a compact capture of the teacher's distribution.
offline distillation: Running the teacher once to a capture file, then training the student against it repeatedly.
same-tokenizer assumption: Teacher and student must share a vocabulary so captured token ids mean the same thing.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

Why mimic, instead of just using more labels?

Offline distillation: run the teacher once

Capture top-k logprobs, not the whole vocabulary

Key terms

Check yourself

Related lessons