How language models work: next-token prediction
After this lesson you can trace a prompt all the way to generated text: how the model produces a probability for every next token, how it picks one, and how decoding settings like temperature change the output.
We've assembled the machine: text becomes tokens, tokens become embeddings, and a stack of Transformer blocks gives every token a context-aware representation. This lesson connects that final representation to the one thing a language model is trained to do — predict the next token — and shows how repeating that prediction produces paragraphs.
From the last layer to a guess about the next token
After the Transformer stack, the representation at the last position is fed into a final linear layer called the LM head. It outputs one number for every token in the vocabulary — these raw scores are called logits. A high logit means "this token is a good continuation"; a low one means "unlikely."
Logits aren't probabilities (they can be negative, and they don't sum to 1). To convert them, the model applies softmax: exponentiate each logit and normalize so they're all positive and sum to 1. The result is a probability distribution over the next token — the model's full answer to "what comes next?"
How it learns to do this: the training objective
Pretraining is exactly the loop from Lesson 2, with one specific loss. Take an ocean of text, and at every position ask the model to predict the next token. The loss — cross-entropy — is large when the model assigned low probability to the token that actually came next, and small when it assigned high probability. Gradient descent then nudges the parameters to make the real next token more likely.
The profound part: to get good at this one narrow game across trillions of tokens, the model is forced to internalize grammar, facts, styles, and reasoning patterns — because all of those help predict what comes next. "Just predicting the next token" is deceptively powerful. (We'll meet cross-entropy properly in Track 1; for now, "make the true next token more probable" is enough.)
Key idea
A base language model is a next-token probability machine. Everything it appears to "do" — answer, summarize, translate — is that machine producing one plausible token after another.
Generation: do it again, and again
One forward pass gives the distribution for a single next token. To produce more than one token, the model is autoregressive: it picks a next token, appends it to the sequence, and runs again to get the following token's distribution — repeating until it emits an end-of-sequence token or hits a length limit.
tokens = tokenize(prompt)
while not done:
logits = model(tokens) # scores for next token
probs = softmax(logits)
next_tok = choose(probs) # see "decoding" below
tokens.append(next_tok)
Decoding: how a token is chosen
That choose step is decoding, and it has settings you'll tune:
- Greedy: always take the single most probable token (argmax). Deterministic and safe, but can be repetitive and bland.
- Sampling: randomly draw a token according to the probabilities — more varied, more creative, occasionally wrong.
- Temperature: a knob that sharpens or flattens the distribution before sampling. Low temperature (→0) approaches greedy and stays "safe"; high temperature spreads probability out and increases surprise.
- Top-p (nucleus): only sample from the smallest set of tokens whose probabilities add up to p (e.g., 0.9), discarding the long tail of unlikely tokens. Curbs nonsense while keeping variety.
Heads up
Decoding does not change the model's parameters — it only changes how you sample from a fixed distribution. For evaluation you usually want greedy (deterministic) decoding so results are reproducible; for creative generation you raise temperature/top-p.
What this means for fine-tuning
Fine-tuning, the subject of the rest of this course, works on exactly this machine. SFT shows the model examples of the inputs you care about and the outputs you want, and runs the same next-token training so that, for your task, the distribution it produces favors the responses you want. You are not bolting on a new ability; you are reshaping the next-token probabilities a base model already has.
That's the full Track-0 mechanism. The remaining two lessons zoom out: the different ways to steer a model (prompting, RAG, fine-tuning, pretraining), and why small models are worth the effort.
Key terms
- LM head
- The final layer producing a score (logit) for every vocabulary token.
- Logits
- Raw, unnormalized scores for each possible next token.
- Softmax
- Turns logits into a probability distribution that sums to 1.
- Cross-entropy
- The training loss; small when the true next token was given high probability.
- Autoregressive generation
- Producing text by repeatedly predicting and appending the next token.
- Greedy / sampling
- Take the most likely token, or randomly draw according to probabilities.
- Temperature / top-p
- Decoding knobs that flatten/sharpen the distribution and trim the unlikely tail.
Check yourself
Four questions. Answers are saved to this browser.