Cross-entropy loss for token prediction
After this lesson you can explain, intuitively, what cross-entropy loss measures for token prediction, why it punishes confident wrong answers, and how it relates to perplexity.
We've said "the loss" many times. For language-model training — pretraining and SFT alike — that loss is almost always cross-entropy. You don't need the calculus, but you should understand exactly what it rewards and punishes, because reading the loss is a daily skill.
The setup
At each completion position the model outputs a probability distribution over the whole vocabulary (logits → softmax, from Track 0). There is one correct next token. Cross-entropy looks only at the probability the model assigned to that correct token, and the loss for that position is the negative logarithm of that probability.
What the logarithm does
The negative-log shape is the whole point:
- Assign the correct token probability 1.0 → loss 0 (perfect).
- Assign it 0.5 → a small positive loss.
- Assign it 0.01 → a large loss.
- Assign it near 0 → loss shoots toward infinity.
So cross-entropy doesn't just want the right token to be likely — it punishes confident mistakes harshly. Being very sure of the wrong token is far worse than being unsure. This is what pushes the model to be well-calibrated, not just frequently right. The total loss is the average of this per-token loss over all the (unmasked) completion tokens.
Key idea
Cross-entropy = "how surprised was the model by the correct next token?" Low loss means it found the true continuation unsurprising. Training drives the model to assign high probability to the tokens that actually occur in your examples.
Why this is exactly the SFT objective
SFT applies this loss to your completion tokens (via the loss mask). Minimizing it makes the model assign high probability to your demonstrated answers — which is precisely "learn to produce these outputs." Pretraining used the same loss over raw text; SFT just points it at curated examples.
Perplexity: cross-entropy in friendlier units
You'll often see perplexity reported alongside loss. Perplexity is just the exponential of the cross-entropy, and it has an intuitive reading: roughly "how many tokens is the model effectively choosing among?" A perplexity of 1 means perfect certainty about the next token; higher means more confusion. It moves in lockstep with the loss (lower is better) but is sometimes easier to reason about. Don't over-index on it, though — for an SFT task what ultimately matters is the gold-set metric, not the loss or perplexity per se.
Now you can read a training curve meaningfully. Next we tune the single knob that most controls whether that curve descends nicely: the learning rate.
Key terms
- Cross-entropy
- The next-token loss: the negative log of the probability the model gave the correct token.
- Negative log likelihood
- Another name for the same quantity; small when correct-token probability is high.
- Confidence penalty
- Cross-entropy punishes confident wrong answers far more than uncertain ones.
- Perplexity
- The exponential of cross-entropy; roughly how many tokens the model is choosing among (1 = certain).
Check yourself
Answers are saved to this browser.