Track 1 · SFT fundamentals · Lesson 10

Cross-entropy loss for token prediction

After this lesson you can explain, intuitively, what cross-entropy loss measures for token prediction, why it punishes confident wrong answers, and how it relates to perplexity.

Level: beginner Read time: ~9 min Prerequisites: The training loop, step by step

We've said "the loss" many times. For language-model training — pretraining and SFT alike — that loss is almost always cross-entropy. You don't need the calculus, but you should understand exactly what it rewards and punishes, because reading the loss is a daily skill.

The setup

At each completion position the model outputs a probability distribution over the whole vocabulary (logits → softmax, from Track 0). There is one correct next token. Cross-entropy looks only at the probability the model assigned to that correct token, and the loss for that position is the negative logarithm of that probability.

What the logarithm does

The negative-log shape is the whole point:

So cross-entropy doesn't just want the right token to be likely — it punishes confident mistakes harshly. Being very sure of the wrong token is far worse than being unsure. This is what pushes the model to be well-calibrated, not just frequently right. The total loss is the average of this per-token loss over all the (unmasked) completion tokens.

Key idea

Cross-entropy = "how surprised was the model by the correct next token?" Low loss means it found the true continuation unsurprising. Training drives the model to assign high probability to the tokens that actually occur in your examples.

Why this is exactly the SFT objective

SFT applies this loss to your completion tokens (via the loss mask). Minimizing it makes the model assign high probability to your demonstrated answers — which is precisely "learn to produce these outputs." Pretraining used the same loss over raw text; SFT just points it at curated examples.

Perplexity: cross-entropy in friendlier units

You'll often see perplexity reported alongside loss. Perplexity is just the exponential of the cross-entropy, and it has an intuitive reading: roughly "how many tokens is the model effectively choosing among?" A perplexity of 1 means perfect certainty about the next token; higher means more confusion. It moves in lockstep with the loss (lower is better) but is sometimes easier to reason about. Don't over-index on it, though — for an SFT task what ultimately matters is the gold-set metric, not the loss or perplexity per se.

Now you can read a training curve meaningfully. Next we tune the single knob that most controls whether that curve descends nicely: the learning rate.

Key terms

Cross-entropy
The next-token loss: the negative log of the probability the model gave the correct token.
Negative log likelihood
Another name for the same quantity; small when correct-token probability is high.
Confidence penalty
Cross-entropy punishes confident wrong answers far more than uncertain ones.
Perplexity
The exponential of cross-entropy; roughly how many tokens the model is choosing among (1 = certain).

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.