Track 2 · Hands-on

Hands-on

Fine-tune small models in Python — environment, data, the SFT loop (raw Trainer and TRL's SFTTrainer), QLoRA, real metrics with sklearn + HF evaluate, structured outputs, multi-turn chat, a project gallery of six SLM use cases, plus the evaluation-rigour layer: LLM-as-a-judge, lm-evaluation-harness, and experiment tracking with MLflow / W&B — with runnable code at every step.

  1. 1. Set up the environment

    Track 2 fine-tunes a small model by hand in PyTorch and Transformers. This first lesson sets up a clean Python environment with torch, transformers, peft, trl, datasets and accelerate, and verifies your GPU is visible.

  2. 2. Load a base model and tokenizer

    Load SmolLM2-135M and its tokenizer with Transformers, inspect the config and parameter count, set the pad token, and run a quick generation to confirm the base model works before any fine-tuning.

  3. 3. Build a tiny SFT dataset

    Construct a small supervised fine-tuning dataset of (prompt, completion) pairs for sentiment classification, wrap it in a datasets.Dataset, check class balance, and split into train and validation — applying Track 1's data-quality lessons in code.

  4. 4. Tokenize and collate: model-ready batches with a loss mask

    Turn (prompt, completion) pairs into input_ids and labels with the prompt masked to -100, using the chat template — then pad them into batches with DataCollatorForSeq2Seq. This is Track 1's loss mask, in code.

  5. 5. A minimal LoRA fine-tune with the Trainer

    Attach a LoRA adapter with peft, configure TrainingArguments with the hyperparameters from Track 1, and run the Hugging Face Trainer to fine-tune SmolLM2 — the whole training step in about 30 lines.

  6. 6. Run it: read the logs, the loss, the checkpoints

    Run the fine-tune and interpret what the Trainer prints: the training loss trajectory, the per-epoch validation loss, and where checkpoints land. Apply Track 1's loss-curve reading to a real run.

  7. 7. Evaluate by hand: run the gold set, compute the metric

    Load the LoRA adapter onto the base model, run predictions over your held-out gold set, compute accuracy, and compare against the untuned base — turning Track 1's evaluation principles into a few lines of code.

  8. 8. Merge the adapter, run inference, ship an artifact

    Merge the LoRA adapter into the base weights to produce a standalone model, save it as safetensors, reload it for plain inference, and understand your export options (including GGUF for edge deployment).

  9. 9. Capstone A: fine-tune SmolLM2 end-to-end, by hand

    The full by-hand pipeline in one runnable script: load SmolLM2, build and tokenize data with a loss mask, train a LoRA adapter, evaluate against the base on a gold set, and merge to a shippable model — with success criteria and what to do if it falls short.

  10. 10. SFT with TRL's SFTTrainer (the 20-line version)

    TRL's SFTTrainer is the SFT loop from Lesson 2.5, wrapped behind a clean API: chat template, loss mask, PEFT, all handled. The same fine-tune in about 20 lines — and a note on what's worth knowing it hides.

  11. 11. QLoRA hands-on with bitsandbytes

    QLoRA = a 4-bit quantized frozen base + LoRA on top. BitsAndBytesConfig with NF4 + double quant + bf16 compute, prepare_model_for_kbit_training, then SFTTrainer. Fit a model several times larger in the same VRAM, with the honest quality re-eval.

  12. 12. Real metrics with sklearn & HF evaluate

    Graduate Lesson 2.7's hand-rolled accuracy: per-class precision/recall/F1 with classification_report, the confusion matrix, macro vs micro F1 on imbalanced data, and HF evaluate for shared standards.

  13. 13. Structured outputs with pydantic

    JSON-emitting SLMs are a top use case. Validate outputs with a Pydantic schema, compute the valid-JSON rate, and measure per-field accuracy on the parses that succeed — the honest two-number report.

  14. 14. Multi-turn chat SFT

    Train on multi-turn conversations with the loss mask applied to every assistant turn (not just the last). Includes the mask-verification check and the honest framing of what multi-turn training does and doesn't teach.

  15. 15. Project gallery: 6 SLM use cases as recipes

    Six concrete projects (sentiment, intent, JSON extractor, PII detector, FAQ assistant with RAG, tool-call generator) as recipes: dataset shape, scoring mode, and the one twist that matters.

  16. 16. LLM-as-a-judge: scoring free-form outputs

    When the output is free-form and no exact-match metric applies. Pairwise A/B vs absolute scoring, the JSON-schema rubric pattern, the three classic biases (position, length, style), judge-model selection, and calibration against a small human-rated set — with the honest beat that LLM judges agree with each other more than with humans.

  17. 17. Public benchmarks & lm-evaluation-harness

    MMLU, HellaSwag, ARC, GSM8K and EleutherAI's lm-evaluation-harness as smoke checks — not as the truth. Few-shot vs zero-shot, how to interpret the numbers, why your gold set still matters more, and the data-contamination caveats that quietly distort headline scores.

  18. 18. Experiment tracking with MLflow & W&B

    Why ad-hoc training loses to bookkeeping every time. MLflow vs W&B vs TensorBoard, what to log on every run (config, code SHA, dataset hash, per-step loss + eval, system metrics, artefacts), and the reproducible-vs-replicable distinction. Closes Track 2.