Beyond a single SFT run: the advanced toolkit
After this lesson you can name the advanced techniques in this track and, given a goal — smaller, more aligned, cheaper to serve, broader, or more reliable in production — say which technique addresses it.
By the end of Track 3 you could take a task from raw data to a deployed endpoint — by hand and through the platform. That is the 80%. This track is the other 20%: the techniques you reach for when a single supervised fine-tune doesn't meet a harder constraint.
What 'advanced' actually means
Every technique here exists to satisfy a constraint SFT alone can't:
- Smaller / cheaper, same quality → distillation (lessons 4.2–4.4): train a small student to mimic a big teacher.
- Aligned to preferences, not just correct → preference tuning (4.5): DPO / ORPO on (chosen, rejected) pairs.
- Fits a device / serves cheaply → quantization & compression (4.6): 4-bit weights, GGUF / ONNX.
- One model, many tasks → multi-task & curriculum (4.7): balancing and ordering training signals.
- Fast, reliable in production → serving & observability (4.8–4.9): throughput, latency, drift.
Lesson 4.10 ties them together into a decision guide and graduates the course.
CONSTRAINT TECHNIQUE LESSONS
--------------------------------- ------------------------ -------
smaller / cheaper, ~same quality distillation 4.2-4.4
aligned to preferences / tone DPO / ORPO 4.5
fits a device / cheap to serve quantization 4.6, 4.8
one model, several related tasks multi-task + curriculum 4.7
fast & reliable in production serving + observability 4.8-4.9
These compose
They're not alternatives — a production SLM often uses several: distill a frontier model into a small student, preference-tune it for tone, quantize it to 4-bit for the edge, and watch it for drift. The skill is knowing which constraint you're solving for.
The prerequisite is everything before this
Each technique here assumes the fundamentals: you know what a loss is and how gradient descent uses it (Track 0), how SFT and the loss mask work (Track 1), how to run training and read its curves (Track 2), and how the platform's stages, gates, and events fit together (Track 3). Advanced methods are recombinations of those primitives, not new magic — distillation is a different loss, preference tuning is a different objective, quantization is a different number format.
Key idea
Advanced ≠ harder for its own sake. Each technique targets one constraint a plain SFT can't meet — size, alignment, cost, breadth, or reliability. Start from the constraint, pick the technique. The rest of this track is one technique per constraint.
Key terms
- distillation
- Training a small student model to mimic a larger teacher, to shrink at near-equal quality.
- preference tuning
- Aligning a model to preferred over dispreferred outputs (DPO / ORPO).
- quantization
- Storing weights in fewer bits to shrink and speed up a model.
- multi-task training
- Training one model on several tasks at once.
- curriculum
- Ordering training data from easier to harder to improve learning.
Check yourself
Answers are saved to this browser.