Attention & the Transformer, gently
After this lesson you can explain, in plain language, what attention does, why "self-attention" lets a model use context, what multi-head means, and how a Transformer block is assembled — enough to read the rest of the course without fear.
We now have a sequence of token embeddings and a way to train networks. But a plain stack of layers treats each token in isolation — and language is all about relationships. In "the trophy didn't fit in the suitcase because it was too big," the word "it" means the trophy, and you only know that by looking at the rest of the sentence. The mechanism that lets a model do this look-around is attention, and the architecture built around it is the Transformer. This lesson stays at the intuition level; you do not need the math to use it.
The problem attention solves
To produce a good representation of a token, the model should mix in information from the other tokens that are relevant to it — and ignore the ones that aren't. "It" needs to gather meaning from "trophy"; "big" needs to gather from "trophy" and "suitcase." Different tokens need different neighbors. Attention is a learnable, content-based way to decide, for each token, which other tokens to pull from and how much.
Self-attention, in three moves
For every token, the model derives three vectors from its embedding (each via learned weights):
- a query — "what am I looking for?"
- a key — "what do I offer to others?"
- a value — "the information I'll hand over if chosen."
Then, for a given token, the model compares its query against every token's key to get a relevance score, turns those scores into weights that sum to 1 (a softmax — the same operation we'll use for next-token probabilities), and takes the weighted average of all the values. The result replaces the token's representation with a blend of the information it found most relevant. Because every token attends to tokens in the same sequence, this is called self-attention.
Key idea
Attention is a soft, learned lookup: each token asks a question (query), every token advertises (key), and the answer is a relevance-weighted mix of what they offer (values). Nothing is hand-wired — the query/key/value projections are parameters trained by gradient descent.
Many heads, looking for different things
One attention pattern can only capture one kind of relationship. Real language has many at once: subject–verb agreement, pronoun reference, adjective–noun, topic. So the model runs several attention computations in parallel — multi-head attention — each with its own query/key/value weights, free to specialize in a different pattern. Their outputs are combined. Heads aren't told what to specialize in; they differentiate during training.
The Transformer block
Attention is the centerpiece, but a Transformer block wraps it with a few supporting parts, and the model stacks many such blocks (a small model might have a dozen; a large one, a hundred):
- Multi-head self-attention — mix information across tokens (what we just covered).
- A feed-forward network — a small two-layer neural net (from Lesson 3) applied to each token independently, to transform the mixed representation.
- Residual connections — add each sub-layer's input to its output, so information and gradients flow cleanly through a deep stack.
- Layer normalization — rescales activations to keep training stable.
You don't need to memorize these. The thing to retain: a Transformer is a tall stack of blocks, each of which lets every token gather context (attention) and then think about it (feed-forward), with residuals and normalization keeping the whole tower trainable.
One detail that matters for generation: causal masking
The language models in this course are decoder-only (also called causal). When predicting the next token, a token is only allowed to attend to tokens before it — never ahead. This causal mask is what makes "predict the next token" a fair game: the model can't peek at the answer it's supposed to produce. It's why the same trained network can both read a prompt and continue it, one token at a time.
Why it scales
Attention compares all tokens to all tokens, which is heavy — but it's highly parallel and captures long-range relationships a word-by-word model would miss. That combination is why Transformers, not older recurrent networks, power modern language models.
Where we are
We can now see the whole pipe: text → tokens → embeddings → a stack of Transformer blocks that contextualize each token. The only missing piece is how the model turns that final per-token representation into an actual next-token prediction — and how it generates text from there. That is the next lesson, and it ties Track 0 together.
Key terms
- Attention
- A learned, content-based mechanism for a token to pull information from relevant tokens.
- Self-attention
- Attention where tokens attend to other tokens in the same sequence.
- Query / key / value
- Per-token vectors: what I seek, what I offer, what I hand over when chosen.
- Multi-head attention
- Several attention computations in parallel, each specializing in a different relationship.
- Transformer block
- Self-attention + feed-forward network + residual connections + layer norm; stacked many times.
- Causal mask (decoder-only)
- Restricting each token to attend only to earlier tokens, so next-token prediction is fair.
Check yourself
Four questions. Answers are saved to this browser.