Track 4 · Advanced

Advanced small language model training

Go beyond a single SFT run with the techniques practitioners reach for next: distillation, DPO/ORPO preference tuning, quantization, multi-task training, serving, observability, production feedback loops, and the trade-offs that decide whether each one is worth it.

Start the track → All tracks Glossary

Track overview · video

A walkthrough of the whole track. The lessons below go deeper.

1. Beyond a single SFT run: the advanced toolkit
You can fine-tune a model and ship it. This final track covers the techniques for when one LoRA SFT run isn't enough — distillation to shrink a model, preference tuning to align it, quantization to compress it, multi-task and curriculum to broaden it, and serving plus observability to run it in production.
2. Knowledge distillation I: the teacher and capturing its logits
Knowledge distillation trains a small student to mimic a larger teacher. This lesson covers why distillation works, the offline workflow BrewSLM uses, the same-tokenizer assumption, and how the teacher's top-k logprobs are captured to teacher_capture.jsonl so the expensive teacher runs only once.
3. Knowledge distillation II: the KD loss
The distillation objective blends two losses: hard-label cross-entropy and a temperature-softened KL divergence to the teacher, as alpha·CE + (1−alpha)·T²·KL. This lesson explains temperature, what each term contributes, and shows the loss in code.
4. Knowledge distillation III: did it work? Quality retained
The point of distillation is a small model that keeps most of the teacher's quality. BrewSLM reports quality_retained = student_score / teacher_score on the same eval, so you can judge the size-for-quality trade and decide whether the distilled student is worth shipping.
5. Preference tuning: DPO and ORPO
SFT teaches a model to produce correct outputs; preference tuning teaches it to prefer better outputs over worse ones. This lesson covers preference data of (chosen, rejected) pairs, the intuition behind DPO, how ORPO folds preference into a single stage, and when to reach for them over SFT.
6. Quantization & compression: Q4_K_M, AWQ, GPTQ, GGUF, ONNX
Quantization stores weights in fewer bits to shrink a model and speed up inference. This lesson covers post-training quantization, the common methods (Q4_K_M k-quants, AWQ, GPTQ), the container formats (GGUF, ONNX, safetensors), the quality-size-speed trade, and how BrewSLM's export tracks the variants.
7. Multi-task training and curriculum
One model can learn several tasks at once, and the order and balance of training data matters. This lesson covers multi-task training and task interference, data balancing, and curriculum learning — ordering examples from easier to harder — plus how to A/B a curriculum on the platform.
8. Serving and inference optimization
Training is half the job; serving the model fast and cheaply is the other half. This lesson covers the inference server (vLLM), the KV cache, continuous batching, the latency-versus-throughput trade, and how quantized serving lowers cost.
9. Observability and drift in production
Shipping a model isn't the end — production quality decays as the world changes. This lesson covers the RunEvent audit spine, scheduled drift checks that re-run the gold set against the live endpoint, the support bundle, and how to decide when to retrain.
10. Capstone C: choosing the right technique, and graduating
The decision guide: given a goal, which technique from the whole Academy do you reach for? Maps goals to SFT, preference tuning, distillation, quantization, multi-task, RAG, and serving choices.
11. Production feedback loop: log to tag to augment to retrain
Lesson 4.9's drift detection tells you the model has gotten worse; this lesson supplies the data that fixes it. Log requests/responses with PII redaction and a sampling budget, tag bad rows (manually, from drift signals, from user feedback), convert tags into SFT examples, and retrain on a measured cadence.
12. Tool-use / function-calling fine-tuning
Fine-tune a small router to emit valid tool calls (JSON with tool name + arguments) for a fixed tool catalogue. Tool schemas in the system message, an explicit no-tool path trained on negatives, evaluated with valid-tool-call rate + argument-match accuracy. The two-number honest report, tool-flavoured.
13. Structured pruning: removing heads, layers, and channels
Physically remove whole heads, layers, or channels from a trained model, then run a short SFT or distillation pass to recover quality. Why unstructured pruning rarely pays off on commodity GPUs, three importance-scoring families (magnitude, gradient × weight, Hessian-based), the prune-to-target then recover workflow, and when it's worth attempting on an SLM (rarely — usually only after quantization and distillation have been pushed).
14. Speculative decoding: draft, verify, accept
An inference-time speed-up: a small draft model proposes K tokens, the target verifies them in a single forward pass, accepted tokens commit. Acceptance-rate math, the same-tokenizer constraint, the EAGLE / Medusa draft-model family, vLLM and llama.cpp tooling — and the honest beat that on a 135M-1.7B target, KV-cache and batching move the latency needle more than spec-decoding does.
15. Reasoning training: CoT SFT, process supervision, GRPO
"Reasoning training" is three techniques: chain-of-thought SFT (train on prompt → trace → answer), process supervision with a step-level reward model (PRM), and outcome-supervised RL with a verifier (ORM), typically run with GRPO at SLM scale (no value network, group-relative advantage). With the honest beat that small-model reasoning claims are often format-following — verifier-grounded eval is the only way to tell.