Track 4 · Advanced

Advanced

Beyond a single fine-tune: distillation, preference tuning, quantization, multi-task and curriculum, serving, and production observability — then the model-efficiency duo (structured pruning, speculative decoding) and reasoning training (CoT SFT, PRM/ORM, GRPO) before graduation.

  1. 1. Beyond a single SFT run: the advanced toolkit

    You can fine-tune a model and ship it. This final track covers the techniques for when one LoRA SFT run isn't enough — distillation to shrink a model, preference tuning to align it, quantization to compress it, multi-task and curriculum to broaden it, and serving plus observability to run it in production.

  2. 2. Knowledge distillation I: the teacher and capturing its logits

    Knowledge distillation trains a small student to mimic a larger teacher. This lesson covers why distillation works, the offline workflow BrewSLM uses, the same-tokenizer assumption, and how the teacher's top-k logprobs are captured to teacher_capture.jsonl so the expensive teacher runs only once.

  3. 3. Knowledge distillation II: the KD loss

    The distillation objective blends two losses: hard-label cross-entropy and a temperature-softened KL divergence to the teacher, as alpha·CE + (1−alpha)·T²·KL. This lesson explains temperature, what each term contributes, and shows the loss in code.

  4. 4. Knowledge distillation III: did it work? Quality retained

    The point of distillation is a small model that keeps most of the teacher's quality. BrewSLM reports quality_retained = student_score / teacher_score on the same eval, so you can judge the size-for-quality trade and decide whether the distilled student is worth shipping.

  5. 5. Preference tuning: DPO and ORPO

    SFT teaches a model to produce correct outputs; preference tuning teaches it to prefer better outputs over worse ones. This lesson covers preference data of (chosen, rejected) pairs, the intuition behind DPO, how ORPO folds preference into a single stage, and when to reach for them over SFT.

  6. 6. Quantization & compression: Q4_K_M, AWQ, GPTQ, GGUF, ONNX

    Quantization stores weights in fewer bits to shrink a model and speed up inference. This lesson covers post-training quantization, the common methods (Q4_K_M k-quants, AWQ, GPTQ), the container formats (GGUF, ONNX, safetensors), the quality-size-speed trade, and how BrewSLM's export tracks the variants.

  7. 7. Multi-task training and curriculum

    One model can learn several tasks at once, and the order and balance of training data matters. This lesson covers multi-task training and task interference, data balancing, and curriculum learning — ordering examples from easier to harder — plus how to A/B a curriculum on the platform.

  8. 8. Serving and inference optimization

    Training is half the job; serving the model fast and cheaply is the other half. This lesson covers the inference server (vLLM), the KV cache, continuous batching, the latency-versus-throughput trade, and how quantized serving lowers cost.

  9. 9. Observability and drift in production

    Shipping a model isn't the end — production quality decays as the world changes. This lesson covers the RunEvent audit spine, scheduled drift checks that re-run the gold set against the live endpoint, the support bundle, and how to decide when to retrain.

  10. 10. Capstone C: choosing the right technique, and graduating

    The decision guide: given a goal, which technique from the whole Academy do you reach for? Maps goals to SFT, preference tuning, distillation, quantization, multi-task, RAG, and serving choices.

  11. 11. Production feedback loop: log to tag to augment to retrain

    Lesson 4.9's drift detection tells you the model has gotten worse; this lesson supplies the data that fixes it. Log requests/responses with PII redaction and a sampling budget, tag bad rows (manually, from drift signals, from user feedback), convert tags into SFT examples, and retrain on a measured cadence.

  12. 12. Tool-use / function-calling fine-tuning

    Fine-tune a small router to emit valid tool calls (JSON with tool name + arguments) for a fixed tool catalogue. Tool schemas in the system message, an explicit no-tool path trained on negatives, evaluated with valid-tool-call rate + argument-match accuracy. The two-number honest report, tool-flavoured.

  13. 13. Structured pruning: removing heads, layers, and channels

    Physically remove whole heads, layers, or channels from a trained model, then run a short SFT or distillation pass to recover quality. Why unstructured pruning rarely pays off on commodity GPUs, three importance-scoring families (magnitude, gradient × weight, Hessian-based), the prune-to-target then recover workflow, and when it's worth attempting on an SLM (rarely — usually only after quantization and distillation have been pushed).

  14. 14. Speculative decoding: draft, verify, accept

    An inference-time speed-up: a small draft model proposes K tokens, the target verifies them in a single forward pass, accepted tokens commit. Acceptance-rate math, the same-tokenizer constraint, the EAGLE / Medusa draft-model family, vLLM and llama.cpp tooling — and the honest beat that on a 135M-1.7B target, KV-cache and batching move the latency needle more than spec-decoding does.

  15. 15. Reasoning training: CoT SFT, process supervision, GRPO

    "Reasoning training" is three techniques: chain-of-thought SFT (train on prompt → trace → answer), process supervision with a step-level reward model (PRM), and outcome-supervised RL with a verifier (ORM), typically run with GRPO at SLM scale (no value network, group-relative advantage). With the honest beat that small-model reasoning claims are often format-following — verifier-grounded eval is the only way to tell.