What distinguishes encoder-only models from decoder-only models?

Encoder-only models use bidirectional attention (every token sees every other) and are trained with masked language modeling; decoder-only models use causal attention (each token only sees previous tokens) and are trained on next-token prediction.

When should you still pick an encoder-only model in 2026?

When the task doesn't need generation — text classification with a fixed label set, named-entity recognition, or building sentence embeddings for retrieval. An encoder-only model with a classification head is usually cheaper and often better than fine-tuning a decoder-only LLM.

Track 0 · Foundations · Lesson 13

Architecture taxonomy: encoder, encoder-decoder, decoder-only

After this lesson you can read a Hugging Face model card, identify which of the three Transformer families it belongs to, and explain when each is the right starting point — including the cases where a decoder-only LLM is the wrong answer.

Level: beginner Read time: ~8 min Prerequisites: From n-grams to Transformers

Every modern language model is built from Transformer blocks, but the blocks can be wired together in three different ways. The difference is how attention flows. That single choice — set at pretraining time — decides what the model is good at, what training objective it learned, and what you should reach for it to do. Three families: encoder-only, encoder-decoder, decoder-only.

Encoder-only: read the whole thing at once

In an encoder-only model, every token can attend to every other token — past and future. There is no left-to-right constraint. The model reads the whole input in parallel and produces a rich contextual representation for each position.

They are pretrained with masked language modelling (MLM): randomly hide ~15% of the tokens in a sentence, and ask the model to recover them using the unmasked context on both sides. Because the model gets to look in both directions, MLM produces dense, deeply contextual representations.

Famous examples: BERT, RoBERTa, DeBERTa, and the modern embedding-model family (BGE, E5).

Great at: classification (sentiment, intent, topic), named-entity recognition, sentence embeddings for search/retrieval, fill-in-the-blank tasks.
Bad at: free-form generation. The training objective never asked the model to produce a sequence one token at a time; it can't natively do that.

Encoder-decoder: input in, output out

An encoder-decoder model has two halves. The encoder is a bidirectional Transformer that reads the input. The decoder is a causal Transformer that generates the output, attending both to the encoder's representation (cross-attention) and to its own previous tokens (self-attention).

Training objectives vary by model: T5 uses span corruption (predict deleted spans of the input); BART uses general denoising (shuffle, delete, mask, reconstruct); both teach the encoder to read and the decoder to generate a target sequence.

Famous examples: T5, FLAN-T5, BART, original transformer for machine translation.

Great at: machine translation, dense summarisation, structured QA — anywhere input and output are clearly distinct sequences with a transformation between them.
Awkward at: open-ended chat. The encoder/decoder split fits "given X, produce Y" cleanly; it fits "have a conversation" less naturally.

Decoder-only: predict the next token, over and over

A decoder-only model uses causal (autoregressive) attention: each token can only attend to tokens that came before it. Training is the simplest possible objective — predict the next token on raw text, no masks, no encoder, no special structure.

Famous examples: GPT family, Llama, Qwen, Mistral, Phi, SmolLM2 — every modern chat model the Academy uses.

Great at: chat, code, free-form generation, summarisation in conversational style — and a property that surprised everyone: in-context learning. Show the model a few examples in the prompt and it learns the pattern without any weight updates.
Bad at, natively: producing a single fixed-size representation of input. You can extract one (mean-pool the last hidden states), but encoder-only models do it cleaner.

Why decoder-only won for chat

Three factors made the decoder-only family dominate general-purpose chat between 2020 and now:

Simplest objective. Next-token prediction needs nothing but raw text. No mask scheme to design, no encoder/decoder split, no auxiliary losses. Easy to scale to terabytes of web data.
In-context learning emerged. The ability to follow few-shot examples in the prompt — to "learn" from context without training — appeared as a side effect of scale, and it's the basis of every modern prompt-engineering practice.
Better scaling. Empirically, decoder-only models kept getting better with more parameters and more data, past the point where encoder-decoder gains flattened. The chinchilla-era scaling work and everything that followed used decoder-only as the substrate.

When to still pick the others

The decoder-only default is right most of the time. It is wrong when the task does not actually need generation.

Need a sentence embedding for retrieval / clustering / search? Use an encoder-only model purpose-built for it (BGE, E5). It will be cheaper and produce better embeddings than mean-pooling a decoder-only LLM.
Need a binary or multi-class classifier? A small encoder-only model with a classification head trained on your data is usually faster and as accurate as fine-tuning Llama for it. A 110M-parameter BERT is hard to beat on its home turf.
Need translation between two languages? Encoder-decoder models (NLLB, FLAN-T5) are the natural shape; their pretraining objective is closer to the task.

Honest beat — don't fine-tune Llama for a classifier

The temptation to use one big decoder-only LLM for everything is real. For a classification or extraction problem with a fixed label set, an encoder-only model with a classification head will train faster, run faster, and often score higher than fine-tuning a 1B-parameter chat model into the same shape. Knowing which family to reach for is half the engineering.

What this Academy assumes

Every track from here forward — SFT fundamentals, Hands-on, With BrewSLM, Advanced — assumes a decoder-only causal LM. The mechanics we teach (the loss mask, chat templates, KV cache, completion- only loss) all live in that family. Most production SLM work uses decoder-only models, so this is the right default. But know the other two exist, know what they're for, and reach for them when the task fits.

Key idea

Three families, one decision — the attention pattern at pretraining. Encoder-only (bidirectional, MLM) for understanding; encoder-decoder (read then generate) for input-to-output transformation; decoder-only (causal, next-token) for chat, code, and any generation that benefits from in-context learning. We use decoder-only — but if the task doesn't need generation, the others are still the right tool.

You have now finished the Foundations track. From here you understand what a model is, how it learns, how text becomes numbers, attention and the Transformer, next-token prediction, the four levers (prompting / RAG / fine-tuning / pretraining), large vs small, the SLM project shape, the base-vs-instruct distinction, how to pick a base, the lineage from n-grams to Transformers, and the three architecture families. Track 1 goes deep on supervised fine-tuning itself.

Key terms

Encoder-only: Transformer model with bidirectional attention; reads input in parallel and produces contextual representations. BERT, RoBERTa, BGE, E5.
Encoder-decoder: Two-half model — a bidirectional encoder reads input, a causal decoder generates output with cross-attention to the encoder. T5, BART.
Decoder-only: Transformer model with causal attention; each token only attends to previous tokens. GPT, Llama, Qwen, SmolLM2. The chat default.
Masked language modelling (MLM): Pretraining objective for encoder-only models — randomly mask tokens, predict them from bidirectional context.
Span corruption / denoising: Encoder-decoder pretraining objectives — delete or corrupt spans of input, train the decoder to reconstruct them.
Causal (autoregressive) attention: Attention masked so each token can only see previous tokens; the defining feature of decoder-only models.
In-context learning: The decoder-only property of learning a pattern from examples in the prompt without any weight updates; emergent at scale.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.