Track 1 · SFT fundamentals · Lesson 3

Anatomy of an SFT example: prompt, completion, and the loss mask

After this lesson you can dissect an SFT training example, explain what the loss mask does and why it matters, and avoid the two classic data bugs (training on the prompt, missing the stop token).

Level: beginner Read time: ~9 min Prerequisites: Choosing the training objective: SFT, DPO, ORPO, RLHF

SFT trains on examples, but what exactly is one example, and how does the model know which part to learn to produce? The answer is a small idea with big consequences: the loss mask.

One example = prompt + completion

An SFT example has two parts: the prompt (the input — an instruction, maybe with context) and the completion (the target output you want). For training, they are concatenated into one token sequence — usually wrapped by a chat template (next lesson) — and fed through the model as a single stream.

The loss mask: learn the answer, not the question

Recall that a causal language model predicts the next token at every position. If we computed the training loss over all positions, the model would spend half its effort learning to predict the user's prompt — which is pointless; at inference the user supplies the prompt. So SFT applies a loss mask: the loss is computed only on the completion tokens. The prompt tokens are still present as context (the model reads them via attention) but they contribute zero to the loss. This is also called label masking — masked positions get a "label" that the loss ignores.

Classify: I loved it → positive <eos> prompt — masked, no loss completion — loss computed here
Loss is computed only on the completion (and its end-of-sequence token). The prompt provides context but isn't a learning target.

Don't forget the stop token

The completion should end with the end-of-sequence (EOS) token, and the loss must include it. That is how the model learns when to stop. Omit it and your fine-tuned model may generate the right answer and then ramble forever, because it never learned to emit a stop.

Multi-turn conversations

For a multi-turn chat example, the same rule generalizes: mask everything except the assistant turns. The system prompt and user turns are context; only the assistant's responses are learning targets. Frameworks and chat templates handle this for you when configured correctly — which is exactly why the next lesson on chat templates matters.

Two classic data bugs

Training on the prompt (no mask) dilutes the signal and wastes capacity. Missing the EOS in the loss produces a model that won't stop. Both pass silently — the loss still goes down — and both quietly hurt the result.

Key terms

Prompt
The input part of an SFT example (instruction, optionally with context).
Completion
The target output the model should learn to produce.
Loss mask / label masking
Computing the loss only on completion tokens, so the model learns to produce the answer, not echo the prompt.
EOS token
End-of-sequence marker; including it in the loss teaches the model when to stop.
Multi-turn masking
In a dialogue, masking everything except the assistant turns.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.