Track 0 · Foundations

Foundations

Start here. Thirteen lessons that build the mental model — what a model is, how it learns, tokens, attention, next-token prediction, and where fine-tuning fits — with no code required.

  1. 1. What is a model?

  2. 2. How models learn: loss & gradient descent

    Training a model means adjusting its parameters to reduce a loss. This lesson makes 'reduce the loss' precise with gradient descent, learning rate, minibatches, and epochs — the engine behind every model you will train.

  3. 3. Neural networks in one page

    A neural network stacks simple units — weighted sums plus a nonlinear activation — into layers that can model complex patterns. This lesson explains neurons, layers, the forward pass, and what backpropagation does.

  4. 4. From text to numbers: tokens & embeddings

    Language models do arithmetic on numbers, not letters. This lesson explains tokenization (splitting text into subword tokens), token IDs, and embeddings (the learned vectors that represent each token) — plus special tokens and the context window.

  5. 5. Attention & the Transformer, gently

    Attention lets each token pull in information from the other tokens that matter. This lesson explains self-attention, multi-head attention, and the Transformer block — the architecture behind every modern language model — without the heavy math.

  6. 6. How language models work: next-token prediction

    A language model turns a sequence of tokens into a probability for every possible next token, then generates text by sampling one and repeating. This lesson covers logits, softmax, autoregressive generation, and decoding settings like temperature and top-p.

  7. 7. Pretraining vs fine-tuning vs prompting vs RAG

    Four ways to make a language model do what you want: prompting, retrieval-augmented generation (RAG), supervised fine-tuning, and continued pretraining. This lesson explains what each changes, what it costs, and how to choose — with a decision guide.

  8. 8. LLMs vs SLMs: scale, cost, latency

    Large and small language models differ mainly in parameter count, which drives memory, latency, and cost. This lesson covers the trade-offs and the core thesis: on a narrow task, a fine-tuned small model can rival a model hundreds of times larger.

  9. 9. The mental model of an SLM project

    A small-model project is a loop: data, train, evaluate, iterate, ship, monitor. This lesson lays out each stage, what 'good' looks like, why the gold set is your north star, and why iteration on data — not the model — is the real work.

  10. 10. Base vs instruct models

    A model family ships in two flavours: a base version trained on raw text, and an instruct version that has been aligned to follow instructions in a chat template. Pick the wrong one and you waste your fine-tune teaching the model what it should have started with.

  11. 11. Picking a base model

    There is no universal 'best' SLM — there are constraints. License, tokenizer family, instruct quality, context length, size and latency, community. Walk the checklist and the smallest model that meets your constraints is the right starting point.

  12. 12. From n-grams to Transformers: a brief history

    Language models did not start with Transformers. n-gram count tables, neural LMs with shared embeddings, RNNs and LSTMs, the 2017 Transformer — each step fixed the previous bottleneck. The lineage is why modern fine-tuning works.

  13. 13. Architecture taxonomy: encoder, encoder-decoder, decoder-only

    Three Transformer families: encoder-only (BERT) for understanding, encoder-decoder (T5) for input-to-output transformation, decoder-only (GPT, Llama, SmolLM2) for generation. Why decoder-only won for chat — and when the others are still the right tool.