Track 1 · SFT fundamentals

Supervised fine-tuning fundamentals

This track teaches the mechanics of fine-tuning a base model on custom datasets: objectives, loss masks, chat templates, tokenization, LoRA, GPU memory, and the evaluation discipline that turns “looks better” into a defensible result.

Start the track → All tracks Glossary

Track overview · two videos

Track 1 splits into the data side (what you feed the trainer) and the training dynamics (what the trainer does with it). Lessons below for the deep-dive.

Track 1A · The data side

Track 1B · The training dynamics

1. What is Supervised Fine-Tuning?
Supervised Fine-Tuning continues training a pretrained model on labeled input-output examples so it reliably produces the outputs you want. This lesson defines SFT precisely, says what it is good at, and — just as important — when not to use it.
2. Choosing the training objective: SFT, DPO, ORPO, RLHF
Beyond SFT there are preference-based objectives (DPO, ORPO, RLHF) and continued pretraining. This lesson explains what each optimizes, when comparative 'A is better than B' data beats single-target data, and how to choose.
3. Anatomy of an SFT example: prompt, completion, and the loss mask
An SFT example is a prompt plus a target completion, concatenated into one token sequence. The loss mask makes the model learn to produce the answer rather than echo the question. This lesson dissects both and the mistakes that break them.
4. Chat templates & special tokens
Instruct models expect a precise format with role markers and special tokens. Using the wrong chat template silently wrecks quality. This lesson explains chat templates, apply_chat_template, the generation prompt, and why training and inference must match.
5. Task shapes: classification, QA, extraction, summarization, chat
A task shape dictates your data format, loss, and evaluation metric. This lesson surveys the common SFT shapes — classification, QA, extraction/NER, summarization, chat — and why the metric must match the shape.
6. Tokenization in practice: padding, truncation, packing
Training adds practical tokenization concerns beyond the basics: padding and attention masks, truncation and max_seq_length, sequence packing, and padding side. This lesson covers each and the data bug that truncation can silently introduce.
7. Data quality I: dedup, balance, leakage, and splits
Most of a fine-tune's quality comes from data, not hyperparameters. This lesson covers deduplication, class balance, the train/validation/test split, and the silent killer — data leakage between train and test.
8. Data quality II: gold sets
A gold set is a small, curated, never-trained-on set of examples with known-correct answers — your single source of truth for whether a model is good enough. This lesson explains how to build one and why it anchors every decision.
9. The training loop, step by step
An SFT training loop is the gradient-descent loop applied to batches of tokenized examples: forward pass, compute loss on the completion, backward pass, optimizer step, repeat — with checkpoints and validation along the way. This lesson walks the whole loop.
10. Cross-entropy loss for token prediction
Cross-entropy is the loss behind next-token training: it is small when the model gave high probability to the correct next token and large when it didn't. This lesson builds the intuition and connects it to perplexity.
11. Learning rate, schedules, warmup, epochs vs steps
The learning rate is the most important knob in fine-tuning. This lesson covers sensible LR ranges, warmup, decay schedules (cosine), and the difference between epochs and steps — plus how to read when the LR is wrong.
12. Batch size, gradient accumulation, and the effective batch
Batch size affects gradient stability and memory. Gradient accumulation lets a small GPU simulate a large batch. This lesson explains the effective batch size and how it interacts with the learning rate.
13. Full fine-tuning vs LoRA
Full fine-tuning updates every parameter; LoRA freezes the model and trains tiny added matrices instead. This lesson explains how LoRA works, why it slashes memory and storage, and the trade-offs versus full fine-tuning.
14. LoRA knobs: rank, alpha, dropout, target modules, QLoRA
LoRA exposes a few settings: rank r, alpha, dropout, and which modules to adapt — plus QLoRA for quantized bases. This lesson explains what each knob does and sensible defaults.
15. GPU memory math
Knowing what consumes GPU memory — weights, gradients, optimizer state, and activations — lets you predict whether a run fits and why LoRA and mixed precision help. This lesson builds a back-of-envelope memory model.
16. OOM and how to survive it
CUDA out-of-memory is the most common fine-tuning failure. This lesson is a practical field guide: the levers (batch size, sequence length, gradient checkpointing, LoRA/QLoRA, precision) ordered by impact and cost.
17. Overfitting, underfitting, and reading a loss curve
A loss curve tells you whether a run is healthy, underfitting, or overfitting. This lesson teaches you to read training vs validation curves, recognize the diverging-validation signature of overfitting, and respond.
18. Evaluation that matches the task
Evaluation turns 'seems better' into a defensible number. This lesson covers choosing a metric that fits the task shape, computing it on the gold set, baselines and gates, and reporting honestly — including where the model loses.
19. Decoding controls: temperature, top-p, stop tokens
Decoding turns a probability distribution into text. Temperature, top-p, top-k, max_new_tokens, stop tokens, and the repetition penalty are not training knobs — they shape what an already-trained model produces, and most "bad output" complaints have a five-minute decoding fix.
20. Dataset formats in the wild
The dataset you receive is almost never the format the trainer wants. JSONL, completion, chat messages, Alpaca, ShareGPT, classification, extraction — the standard shapes, what each is for, and how to convert between them in a few lines.
21. Continued pretraining: when SFT isn't enough
When the base model doesn't speak the language of your domain at the token level, SFT can't fix it. Continued pretraining (CPT) is the right tool — but only when that's the actual problem. Data shape, recipe knobs (smaller LR, longer sequences, vocab-extension trade-offs), the two-stage CPT → SFT pipeline, and how to evaluate honestly by downstream task performance, not CPT loss.
22. Catastrophic forgetting
Fine-tuning on narrow data can degrade abilities the base model already had. This lesson explains why SFT is especially susceptible and covers the standard mitigations: prefer LoRA, mix in a slice of broad data, stop early, evaluate broadly.