Are these advanced techniques mutually exclusive?

No — production SLMs often compose several

What's the right starting point for choosing a technique?

The constraint you're solving for (size, alignment, cost, breadth, reliability)

Track 4 · Advanced · Lesson 1

Beyond a single SFT run: the advanced toolkit

After this lesson you can name the advanced techniques in this track and, given a goal — smaller, more aligned, cheaper to serve, broader, or more reliable in production — say which technique addresses it.

Level: advanced Read time: ~8 min Prerequisites: Eval pack & failure cluster reference

By the end of Track 3 you could take a task from raw data to a deployed endpoint — by hand and through the platform. That is the 80%. This track is the other 20%: the techniques you reach for when a single supervised fine-tune doesn't meet a harder constraint.

What 'advanced' actually means

Every technique here exists to satisfy a constraint SFT alone can't:

Smaller / cheaper, same quality → distillation (lessons 4.2–4.4): train a small student to mimic a big teacher.
Aligned to preferences, not just correct → preference tuning (4.5): DPO / ORPO on (chosen, rejected) pairs.
Fits a device / serves cheaply → quantization & compression (4.6): 4-bit weights, GGUF / ONNX.
One model, many tasks → multi-task & curriculum (4.7): balancing and ordering training signals.
Fast, reliable in production → serving & observability (4.8–4.9): throughput, latency, drift.

Lesson 4.10 ties them together into a decision guide and graduates the course.

CONSTRAINT                          TECHNIQUE                  LESSONS
---------------------------------   ------------------------   -------
smaller / cheaper, ~same quality    distillation               4.2-4.4
aligned to preferences / tone       DPO / ORPO                 4.5
fits a device / cheap to serve      quantization               4.6, 4.8
one model, several related tasks    multi-task + curriculum    4.7
fast & reliable in production       serving + observability    4.8-4.9

These compose

They're not alternatives — a production SLM often uses several: distill a frontier model into a small student, preference-tune it for tone, quantize it to 4-bit for the edge, and watch it for drift. The skill is knowing which constraint you're solving for.

The prerequisite is everything before this

Each technique here assumes the fundamentals: you know what a loss is and how gradient descent uses it (Track 0), how SFT and the loss mask work (Track 1), how to run training and read its curves (Track 2), and how the platform's stages, gates, and events fit together (Track 3). Advanced methods are recombinations of those primitives, not new magic — distillation is a different loss, preference tuning is a different objective, quantization is a different number format.

Key idea

Advanced ≠ harder for its own sake. Each technique targets one constraint a plain SFT can't meet — size, alignment, cost, breadth, or reliability. Start from the constraint, pick the technique. The rest of this track is one technique per constraint.

Key terms

distillation: Training a small student model to mimic a larger teacher, to shrink at near-equal quality.
preference tuning: Aligning a model to preferred over dispreferred outputs (DPO / ORPO).
quantization: Storing weights in fewer bits to shrink and speed up a model.
multi-task training: Training one model on several tasks at once.
curriculum: Ordering training data from easier to harder to improve learning.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.