Track 0 · Foundations · Lesson 13

Architecture taxonomy: encoder, encoder-decoder, decoder-only

After this lesson you can read a Hugging Face model card, identify which of the three Transformer families it belongs to, and explain when each is the right starting point — including the cases where a decoder-only LLM is the wrong answer.

Level: beginner Read time: ~8 min Prerequisites: From n-grams to Transformers

Every modern language model is built from Transformer blocks, but the blocks can be wired together in three different ways. The difference is how attention flows. That single choice — set at pretraining time — decides what the model is good at, what training objective it learned, and what you should reach for it to do. Three families: encoder-only, encoder-decoder, decoder-only.

Encoder-only: read the whole thing at once

In an encoder-only model, every token can attend to every other token — past and future. There is no left-to-right constraint. The model reads the whole input in parallel and produces a rich contextual representation for each position.

They are pretrained with masked language modelling (MLM): randomly hide ~15% of the tokens in a sentence, and ask the model to recover them using the unmasked context on both sides. Because the model gets to look in both directions, MLM produces dense, deeply contextual representations.

Famous examples: BERT, RoBERTa, DeBERTa, and the modern embedding-model family (BGE, E5).

Encoder-decoder: input in, output out

An encoder-decoder model has two halves. The encoder is a bidirectional Transformer that reads the input. The decoder is a causal Transformer that generates the output, attending both to the encoder's representation (cross-attention) and to its own previous tokens (self-attention).

Training objectives vary by model: T5 uses span corruption (predict deleted spans of the input); BART uses general denoising (shuffle, delete, mask, reconstruct); both teach the encoder to read and the decoder to generate a target sequence.

Famous examples: T5, FLAN-T5, BART, original transformer for machine translation.

Decoder-only: predict the next token, over and over

A decoder-only model uses causal (autoregressive) attention: each token can only attend to tokens that came before it. Training is the simplest possible objective — predict the next token on raw text, no masks, no encoder, no special structure.

Famous examples: GPT family, Llama, Qwen, Mistral, Phi, SmolLM2 — every modern chat model the Academy uses.

Why decoder-only won for chat

Three factors made the decoder-only family dominate general-purpose chat between 2020 and now:

When to still pick the others

The decoder-only default is right most of the time. It is wrong when the task does not actually need generation.

Honest beat — don't fine-tune Llama for a classifier

The temptation to use one big decoder-only LLM for everything is real. For a classification or extraction problem with a fixed label set, an encoder-only model with a classification head will train faster, run faster, and often score higher than fine-tuning a 1B-parameter chat model into the same shape. Knowing which family to reach for is half the engineering.

What this Academy assumes

Every track from here forward — SFT fundamentals, Hands-on, With BrewSLM, Advanced — assumes a decoder-only causal LM. The mechanics we teach (the loss mask, chat templates, KV cache, completion- only loss) all live in that family. Most production SLM work uses decoder-only models, so this is the right default. But know the other two exist, know what they're for, and reach for them when the task fits.

Key idea

Three families, one decision — the attention pattern at pretraining. Encoder-only (bidirectional, MLM) for understanding; encoder-decoder (read then generate) for input-to-output transformation; decoder-only (causal, next-token) for chat, code, and any generation that benefits from in-context learning. We use decoder-only — but if the task doesn't need generation, the others are still the right tool.

You have now finished the Foundations track. From here you understand what a model is, how it learns, how text becomes numbers, attention and the Transformer, next-token prediction, the four levers (prompting / RAG / fine-tuning / pretraining), large vs small, the SLM project shape, the base-vs-instruct distinction, how to pick a base, the lineage from n-grams to Transformers, and the three architecture families. Track 1 goes deep on supervised fine-tuning itself.

Key terms

Encoder-only
Transformer model with bidirectional attention; reads input in parallel and produces contextual representations. BERT, RoBERTa, BGE, E5.
Encoder-decoder
Two-half model — a bidirectional encoder reads input, a causal decoder generates output with cross-attention to the encoder. T5, BART.
Decoder-only
Transformer model with causal attention; each token only attends to previous tokens. GPT, Llama, Qwen, SmolLM2. The chat default.
Masked language modelling (MLM)
Pretraining objective for encoder-only models — randomly mask tokens, predict them from bidirectional context.
Span corruption / denoising
Encoder-decoder pretraining objectives — delete or corrupt spans of input, train the decoder to reconstruct them.
Causal (autoregressive) attention
Attention masked so each token can only see previous tokens; the defining feature of decoder-only models.
In-context learning
The decoder-only property of learning a pattern from examples in the prompt without any weight updates; emergent at scale.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.