Why does SFT compute loss only on the completion tokens?

So the model learns to produce the answer rather than predict the user's prompt

What happens if the EOS token is left out of the loss?

The model may produce the answer then never stop

A 'silent' SFT data bug is one that…

Still lets the loss go down but quietly hurts the result

Track 1 · SFT fundamentals · Lesson 3

Anatomy of an SFT example: prompt, completion, and the loss mask

After this lesson you can dissect an SFT training example, explain what the loss mask does and why it matters, and avoid the two classic data bugs (training on the prompt, missing the stop token).

Level: beginner Read time: ~9 min Prerequisites: Choosing the training objective: SFT, DPO, ORPO, RLHF

SFT trains on examples, but what exactly is one example, and how does the model know which part to learn to produce? The answer is a small idea with big consequences: the loss mask.

One example = prompt + completion

An SFT example has two parts: the prompt (the input — an instruction, maybe with context) and the completion (the target output you want). For training, they are concatenated into one token sequence — usually wrapped by a chat template (next lesson) — and fed through the model as a single stream.

The loss mask: learn the answer, not the question

Recall that a causal language model predicts the next token at every position. If we computed the training loss over all positions, the model would spend half its effort learning to predict the user's prompt — which is pointless; at inference the user supplies the prompt. So SFT applies a loss mask: the loss is computed only on the completion tokens. The prompt tokens are still present as context (the model reads them via attention) but they contribute zero to the loss. This is also called label masking — masked positions get a "label" that the loss ignores.

Loss is computed only on the completion (and its end-of-sequence token). The prompt provides context but isn't a learning target.

Don't forget the stop token

The completion should end with the end-of-sequence (EOS) token, and the loss must include it. That is how the model learns when to stop. Omit it and your fine-tuned model may generate the right answer and then ramble forever, because it never learned to emit a stop.

Multi-turn conversations

For a multi-turn chat example, the same rule generalizes: mask everything except the assistant turns. The system prompt and user turns are context; only the assistant's responses are learning targets. Frameworks and chat templates handle this for you when configured correctly — which is exactly why the next lesson on chat templates matters.

Two classic data bugs

Training on the prompt (no mask) dilutes the signal and wastes capacity. Missing the EOS in the loss produces a model that won't stop. Both pass silently — the loss still goes down — and both quietly hurt the result.

Key terms

Prompt: The input part of an SFT example (instruction, optionally with context).
Completion: The target output the model should learn to produce.
Loss mask / label masking: Computing the loss only on completion tokens, so the model learns to produce the answer, not echo the prompt.
EOS token: End-of-sequence marker; including it in the loss teaches the model when to stop.
Multi-turn masking: In a dialogue, masking everything except the assistant turns.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.