Track 2 · Hands-on

Hands-on small language model fine-tuning

This is the runnable tutorial track: load a base model, build a custom dataset, apply LoRA or QLoRA, run the training loop in Python, evaluate the result, and ship an artifact you can test on your own task.

Start the track → All tracks Glossary

Track overview · video

A walkthrough of the whole track. The lessons below go deeper.

1. Set up the environment
Track 2 fine-tunes a small model by hand in PyTorch and Transformers. This first lesson sets up a clean Python environment with torch, transformers, peft, trl, datasets and accelerate, and verifies your GPU is visible.
2. Load a base model and tokenizer
Load SmolLM2-135M and its tokenizer with Transformers, inspect the config and parameter count, set the pad token, and run a quick generation to confirm the base model works before any fine-tuning.
3. Build a tiny SFT dataset
Construct a small supervised fine-tuning dataset of (prompt, completion) pairs for sentiment classification, wrap it in a datasets.Dataset, check class balance, and split into train and validation — applying Track 1's data-quality lessons in code.
4. Tokenize and collate: model-ready batches with a loss mask
Turn (prompt, completion) pairs into input_ids and labels with the prompt masked to -100, using the chat template — then pad them into batches with DataCollatorForSeq2Seq. This is Track 1's loss mask, in code.
5. A minimal LoRA fine-tune with the Trainer
Attach a LoRA adapter with peft, configure TrainingArguments with the hyperparameters from Track 1, and run the Hugging Face Trainer to fine-tune SmolLM2 — the whole training step in about 30 lines.
6. Run it: read the logs, the loss, the checkpoints
Run the fine-tune and interpret what the Trainer prints: the training loss trajectory, the per-epoch validation loss, and where checkpoints land. Apply Track 1's loss-curve reading to a real run.
7. Evaluate by hand: run the gold set, compute the metric
Load the LoRA adapter onto the base model, run predictions over your held-out gold set, compute accuracy, and compare against the untuned base — turning Track 1's evaluation principles into a few lines of code.
8. Merge the adapter, run inference, ship an artifact
Merge the LoRA adapter into the base weights to produce a standalone model, save it as safetensors, reload it for plain inference, and understand your export options (including GGUF for edge deployment).
9. Capstone A: fine-tune SmolLM2 end-to-end, by hand
The full by-hand pipeline in one runnable script: load SmolLM2, build and tokenize data with a loss mask, train a LoRA adapter, evaluate against the base on a gold set, and merge to a shippable model — with success criteria and what to do if it falls short.
10. SFT with TRL's SFTTrainer (the 20-line version)
TRL's SFTTrainer is the SFT loop from Lesson 2.5, wrapped behind a clean API: chat template, loss mask, PEFT, all handled. The same fine-tune in about 20 lines — and a note on what's worth knowing it hides.
11. QLoRA hands-on with bitsandbytes
QLoRA = a 4-bit quantized frozen base + LoRA on top. BitsAndBytesConfig with NF4 + double quant + bf16 compute, prepare_model_for_kbit_training, then SFTTrainer. Fit a model several times larger in the same VRAM, with the honest quality re-eval.
12. Real metrics with sklearn & HF evaluate
Graduate Lesson 2.7's hand-rolled accuracy: per-class precision/recall/F1 with classification_report, the confusion matrix, macro vs micro F1 on imbalanced data, and HF evaluate for shared standards.
13. Structured outputs with pydantic
JSON-emitting SLMs are a top use case. Validate outputs with a Pydantic schema, compute the valid-JSON rate, and measure per-field accuracy on the parses that succeed — the honest two-number report.
14. Multi-turn chat SFT
Train on multi-turn conversations with the loss mask applied to every assistant turn (not just the last). Includes the mask-verification check and the honest framing of what multi-turn training does and doesn't teach.
15. Project gallery: 6 SLM use cases as recipes
Six concrete projects (sentiment, intent, JSON extractor, PII detector, FAQ assistant with RAG, tool-call generator) as recipes: dataset shape, scoring mode, and the one twist that matters.
16. LLM-as-a-judge: scoring free-form outputs
When the output is free-form and no exact-match metric applies. Pairwise A/B vs absolute scoring, the JSON-schema rubric pattern, the three classic biases (position, length, style), judge-model selection, and calibration against a small human-rated set — with the honest beat that LLM judges agree with each other more than with humans.
17. Public benchmarks & lm-evaluation-harness
MMLU, HellaSwag, ARC, GSM8K and EleutherAI's lm-evaluation-harness as smoke checks — not as the truth. Few-shot vs zero-shot, how to interpret the numbers, why your gold set still matters more, and the data-contamination caveats that quietly distort headline scores.
18. Experiment tracking with MLflow & W&B
Why ad-hoc training loses to bookkeeping every time. MLflow vs W&B vs TensorBoard, what to log on every run (config, code SHA, dataset hash, per-step loss + eval, system metrics, artefacts), and the reproducible-vs-replicable distinction. Closes Track 2.