The mental model of an SLM project
After this lesson you can describe the end-to-end loop of a fine-tuning project, name what each stage produces, explain why the gold set is your single source of truth, and see how the rest of the Academy maps onto this loop.
You now understand the parts. This lesson assembles them into the process you'll repeat for the rest of the course. Training a model is not a one-shot event; it's a loop you go around several times, and most of the work is not in the training step at all — it's in the data and the evaluation around it.
The loop
A fine-tuning project has six stages, and the last one usually sends you back to the start:
- Data. Gather and prepare examples of your task: inputs paired with the outputs you want. Clean them, format them, split them. This is where most of your time goes and where most of your quality comes from.
- Train. Run supervised fine-tuning (usually with LoRA) on a base model. This is the gradient-descent loop from Lesson 2, now over a Transformer.
- Evaluate. Measure the trained model against held-out examples it never trained on, using a metric that fits the task. This tells you whether you actually improved.
- Iterate. Look at where it fails, fix the data (add examples for weak cases, remove bad ones, rebalance), and retrain. Repeat until the metric clears your bar.
- Export & deploy. Package the model into a servable artifact and stand it up behind an endpoint, with a version you can roll back.
- Monitor. Watch real-world inputs for drift — the day the live data stops resembling your training data, quality slips and you loop back to step 1.
Your north star: the gold set
The most important artifact in the whole loop is the gold set — a curated batch of examples, with known-correct answers, that the model never trains on. It exists solely to measure quality. Every decision ("is this version better?", "did this data change help?", "is it good enough to ship?") is answered against the gold set. If you train on your evaluation data, your numbers become fiction — the model can memorize the test. Keeping the gold set separate and trustworthy is the discipline that makes all your metrics meaningful.
Key idea
Modern model-building is data-centric: you usually improve results far more by fixing data than by fiddling with the model. The loop's center of gravity is "evaluate → understand failures → improve data," not "tweak hyperparameters."
What "good" looks like at each stage
- Good data: representative of real inputs, correctly labelled, deduplicated, balanced across cases, with a clean train/validation/test split and a trustworthy gold set.
- A good training run: a loss that descends and stabilizes, no divergence, a checkpoint saved — and a model that does better on the gold set than the base did.
- Good evaluation: a metric that matches the task shape, reported honestly — including where the model still loses.
- A good deployment: a versioned, reproducible artifact you can roll back, with monitoring that will tell you when reality drifts.
How the rest of the Academy maps to this loop
Everything ahead is this loop, at increasing depth:
- Track 1 — SFT fundamentals: the why and what of each stage — datasets, tokenization, the training loop, LoRA, evaluation — in detail.
- Track 2 — Hands-on: you run the whole loop by hand in PyTorch and Transformers, with no platform, so nothing is magic.
- Track 3 — With BrewSLM: the same loop, with the platform automating the tedious parts (import, gates, eval packs, deployment) so you can iterate faster.
- Track 4 — Graduating: advanced moves that bend the quality/cost curve — distillation, warm starts, hyperparameter bake-offs, and proving your SLM is good enough versus a frontier model.
That completes the foundations. You can now talk about what a model is, how it learns, how language models produce text, the levers for steering them, and the shape of a real project. Next, Track 1 makes Supervised Fine-Tuning precise — starting with what SFT actually is, and when not to use it.
Key terms
- Project lifecycle
- The loop: data → train → evaluate → iterate → export/deploy → monitor.
- Gold set
- Curated, known-correct examples the model never trains on; the source of truth for quality.
- Train/validation/test split
- Partitioning data so you train on one part and measure on data the model hasn't seen.
- Data-centric iteration
- Improving results primarily by fixing data, guided by failure analysis.
- Deployment
- Packaging and serving the model as a versioned, rollback-able artifact.
- Drift
- When live inputs stop resembling training data, degrading quality over time.
Check yourself
Four questions. Answers are saved to this browser.