Track 0 · Foundations · Lesson 5

Attention & the Transformer, gently

After this lesson you can explain, in plain language, what attention does, why "self-attention" lets a model use context, what multi-head means, and how a Transformer block is assembled — enough to read the rest of the course without fear.

Level: beginner Read time: ~11 min Prerequisites: Tokens & embeddings

We now have a sequence of token embeddings and a way to train networks. But a plain stack of layers treats each token in isolation — and language is all about relationships. In "the trophy didn't fit in the suitcase because it was too big," the word "it" means the trophy, and you only know that by looking at the rest of the sentence. The mechanism that lets a model do this look-around is attention, and the architecture built around it is the Transformer. This lesson stays at the intuition level; you do not need the math to use it.

The problem attention solves

To produce a good representation of a token, the model should mix in information from the other tokens that are relevant to it — and ignore the ones that aren't. "It" needs to gather meaning from "trophy"; "big" needs to gather from "trophy" and "suitcase." Different tokens need different neighbors. Attention is a learnable, content-based way to decide, for each token, which other tokens to pull from and how much.

Self-attention, in three moves

For every token, the model derives three vectors from its embedding (each via learned weights):

Then, for a given token, the model compares its query against every token's key to get a relevance score, turns those scores into weights that sum to 1 (a softmax — the same operation we'll use for next-token probabilities), and takes the weighted average of all the values. The result replaces the token's representation with a blend of the information it found most relevant. Because every token attends to tokens in the same sequence, this is called self-attention.

Key idea

Attention is a soft, learned lookup: each token asks a question (query), every token advertises (key), and the answer is a relevance-weighted mix of what they offer (values). Nothing is hand-wired — the query/key/value projections are parameters trained by gradient descent.

thetrophydid…becauseit "it" attends strongly to "trophy" (thick), weakly to others (dashed) attention weights are learned and depend on the content of each token
A single attention pattern for the token "it." The weights are computed from queries and keys, not written by a human.

Many heads, looking for different things

One attention pattern can only capture one kind of relationship. Real language has many at once: subject–verb agreement, pronoun reference, adjective–noun, topic. So the model runs several attention computations in parallel — multi-head attention — each with its own query/key/value weights, free to specialize in a different pattern. Their outputs are combined. Heads aren't told what to specialize in; they differentiate during training.

The Transformer block

Attention is the centerpiece, but a Transformer block wraps it with a few supporting parts, and the model stacks many such blocks (a small model might have a dozen; a large one, a hundred):

You don't need to memorize these. The thing to retain: a Transformer is a tall stack of blocks, each of which lets every token gather context (attention) and then think about it (feed-forward), with residuals and normalization keeping the whole tower trainable.

One detail that matters for generation: causal masking

The language models in this course are decoder-only (also called causal). When predicting the next token, a token is only allowed to attend to tokens before it — never ahead. This causal mask is what makes "predict the next token" a fair game: the model can't peek at the answer it's supposed to produce. It's why the same trained network can both read a prompt and continue it, one token at a time.

Why it scales

Attention compares all tokens to all tokens, which is heavy — but it's highly parallel and captures long-range relationships a word-by-word model would miss. That combination is why Transformers, not older recurrent networks, power modern language models.

Where we are

We can now see the whole pipe: text → tokens → embeddings → a stack of Transformer blocks that contextualize each token. The only missing piece is how the model turns that final per-token representation into an actual next-token prediction — and how it generates text from there. That is the next lesson, and it ties Track 0 together.

Key terms

Attention
A learned, content-based mechanism for a token to pull information from relevant tokens.
Self-attention
Attention where tokens attend to other tokens in the same sequence.
Query / key / value
Per-token vectors: what I seek, what I offer, what I hand over when chosen.
Multi-head attention
Several attention computations in parallel, each specializing in a different relationship.
Transformer block
Self-attention + feed-forward network + residual connections + layer norm; stacked many times.
Causal mask (decoder-only)
Restricting each token to attend only to earlier tokens, so next-token prediction is fair.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.