Track 0 · Foundations

Small language model foundations

Start here if you want to fine-tune or train a small language model without hand-waving. These lessons build the concepts behind base models, tokens, attention, next-token prediction, RAG vs SFT, and how to choose the right starting point for your domain.

Start the track → All tracks Glossary

Track overview · video

A walkthrough of the whole track. The lessons below go deeper.

1. What is a model?
2. How models learn: loss & gradient descent
Training a model means adjusting its parameters to reduce a loss. This lesson makes 'reduce the loss' precise with gradient descent, learning rate, minibatches, and epochs — the engine behind every model you will train.
3. Neural networks in one page
A neural network stacks simple units — weighted sums plus a nonlinear activation — into layers that can model complex patterns. This lesson explains neurons, layers, the forward pass, and what backpropagation does.
4. From text to numbers: tokens & embeddings
Language models do arithmetic on numbers, not letters. This lesson explains tokenization (splitting text into subword tokens), token IDs, and embeddings (the learned vectors that represent each token) — plus special tokens and the context window.
5. Attention & the Transformer, gently
Attention lets each token pull in information from the other tokens that matter. This lesson explains self-attention, multi-head attention, and the Transformer block — the architecture behind every modern language model — without the heavy math.
6. How language models work: next-token prediction
A language model turns a sequence of tokens into a probability for every possible next token, then generates text by sampling one and repeating. This lesson covers logits, softmax, autoregressive generation, and decoding settings like temperature and top-p.
7. Pretraining vs fine-tuning vs prompting vs RAG
Four ways to make a language model do what you want: prompting, retrieval-augmented generation (RAG), supervised fine-tuning, and continued pretraining. This lesson explains what each changes, what it costs, and how to choose — with a decision guide.
8. LLMs vs SLMs: scale, cost, latency
Large and small language models differ mainly in parameter count, which drives memory, latency, and cost. This lesson covers the trade-offs and the core thesis: on a narrow task, a fine-tuned small model can rival a model hundreds of times larger.
9. The mental model of an SLM project
A small-model project is a loop: data, train, evaluate, iterate, ship, monitor. This lesson lays out each stage, what 'good' looks like, why the gold set is your north star, and why iteration on data — not the model — is the real work.
10. Base vs instruct models
A model family ships in two flavours: a base version trained on raw text, and an instruct version that has been aligned to follow instructions in a chat template. Pick the wrong one and you waste your fine-tune teaching the model what it should have started with.
11. Picking a base model
There is no universal 'best' SLM — there are constraints. License, tokenizer family, instruct quality, context length, size and latency, community. Walk the checklist and the smallest model that meets your constraints is the right starting point.
12. From n-grams to Transformers: a brief history
Language models did not start with Transformers. n-gram count tables, neural LMs with shared embeddings, RNNs and LSTMs, the 2017 Transformer — each step fixed the previous bottleneck. The lineage is why modern fine-tuning works.
13. Architecture taxonomy: encoder, encoder-decoder, decoder-only
Three Transformer families: encoder-only (BERT) for understanding, encoder-decoder (T5) for input-to-output transformation, decoder-only (GPT, Llama, SmolLM2) for generation. Why decoder-only won for chat — and when the others are still the right tool.