Foundations
Start here. Thirteen lessons that build the mental model — what a model is, how it learns, tokens, attention, next-token prediction, and where fine-tuning fits — with no code required.
- 1. What is a model?
-
2. How models learn: loss & gradient descent
Training a model means adjusting its parameters to reduce a loss. This lesson makes 'reduce the loss' precise with gradient descent, learning rate, minibatches, and epochs — the engine behind every model you will train.
-
3. Neural networks in one page
A neural network stacks simple units — weighted sums plus a nonlinear activation — into layers that can model complex patterns. This lesson explains neurons, layers, the forward pass, and what backpropagation does.
-
4. From text to numbers: tokens & embeddings
Language models do arithmetic on numbers, not letters. This lesson explains tokenization (splitting text into subword tokens), token IDs, and embeddings (the learned vectors that represent each token) — plus special tokens and the context window.
-
5. Attention & the Transformer, gently
Attention lets each token pull in information from the other tokens that matter. This lesson explains self-attention, multi-head attention, and the Transformer block — the architecture behind every modern language model — without the heavy math.
-
6. How language models work: next-token prediction
A language model turns a sequence of tokens into a probability for every possible next token, then generates text by sampling one and repeating. This lesson covers logits, softmax, autoregressive generation, and decoding settings like temperature and top-p.
-
7. Pretraining vs fine-tuning vs prompting vs RAG
Four ways to make a language model do what you want: prompting, retrieval-augmented generation (RAG), supervised fine-tuning, and continued pretraining. This lesson explains what each changes, what it costs, and how to choose — with a decision guide.
-
8. LLMs vs SLMs: scale, cost, latency
Large and small language models differ mainly in parameter count, which drives memory, latency, and cost. This lesson covers the trade-offs and the core thesis: on a narrow task, a fine-tuned small model can rival a model hundreds of times larger.
-
9. The mental model of an SLM project
A small-model project is a loop: data, train, evaluate, iterate, ship, monitor. This lesson lays out each stage, what 'good' looks like, why the gold set is your north star, and why iteration on data — not the model — is the real work.
-
10. Base vs instruct models
A model family ships in two flavours: a base version trained on raw text, and an instruct version that has been aligned to follow instructions in a chat template. Pick the wrong one and you waste your fine-tune teaching the model what it should have started with.
-
11. Picking a base model
There is no universal 'best' SLM — there are constraints. License, tokenizer family, instruct quality, context length, size and latency, community. Walk the checklist and the smallest model that meets your constraints is the right starting point.
-
12. From n-grams to Transformers: a brief history
Language models did not start with Transformers. n-gram count tables, neural LMs with shared embeddings, RNNs and LSTMs, the 2017 Transformer — each step fixed the previous bottleneck. The lineage is why modern fine-tuning works.
-
13. Architecture taxonomy: encoder, encoder-decoder, decoder-only
Three Transformer families: encoder-only (BERT) for understanding, encoder-decoder (T5) for input-to-output transformation, decoder-only (GPT, Llama, SmolLM2) for generation. Why decoder-only won for chat — and when the others are still the right tool.