Why must training and inference use the same chat template?

Because a format mismatch pushes the model off its training distribution and silently wrecks quality

What is the right way to format chat data?

Use the tokenizer's own apply_chat_template

What does add_generation_prompt=True do?

Appends the assistant-turn opener so the model knows to start responding

Track 1 · SFT fundamentals · Lesson 4

Chat templates & special tokens

After this lesson you can explain why chat formatting is a correctness issue, use the tokenizer's own chat template, and recognize the silent failure that a template mismatch causes.

Level: beginner Read time: ~9 min Prerequisites: Anatomy of an SFT example: prompt, completion, and the loss mask

An instruct model wasn't trained on bare text — it was trained on conversations formatted in a very specific way, with special tokens marking who is speaking. If you feed it a different format, you push it off the data it understands, and quality collapses. This is one of the most common silent failures in fine-tuning, and it's entirely avoidable.

Roles and special tokens

A chat is a list of messages, each with a role: system (standing instructions), user (the human), and assistant (the model). To turn that list into a single token stream, the model's training used special tokens — control tokens in the vocabulary that aren't ordinary text — to delimit each turn. Different model families use different markers; one common style looks like:

<|im_start|>system

You are a helpful assistant.<|im_end|>

<|im_start|>user

Classify: I loved it<|im_end|>

<|im_start|>assistant

positive<|im_end|>

The chat template does this for you

Every instruct tokenizer ships a chat template — a rule that turns a messages list into exactly the formatted string that model expects. You call tokenizer.apply_chat_template(messages) and get the correct string with the right special tokens, every time. The cardinal rule:

Key idea

Use the model's own chat template for both training and inference, and never hand-format. The format you train on must match the format you serve on — and both must match what the base model was trained with.

The generation prompt

At inference you pass the system + user messages and set add_generation_prompt=True. That appends the opening of the assistant turn (e.g. <|im_start|>assistant) so the model knows it's now its turn to speak and starts generating the response. During training you don't add it — the assistant turn is already present as the completion.

Where templates meet the loss mask

Combine this with the previous lesson: the chat template lays out the full conversation, and the loss mask is set so that only the assistant content (plus its end token) is a learning target. Good tooling derives the mask from the template automatically — another reason to let the template do the work rather than building strings by hand.

The silent failure

A template mismatch rarely errors. You'll see a plausible training loss and then garbage or oddly formatted outputs at inference. If a freshly fine-tuned model behaves strangely, check the chat template first — it's the usual culprit.

Key terms

Chat template: A tokenizer rule that converts a messages list into the exact formatted string the model expects.
Role: Who is speaking: system, user, or assistant.
Special/control tokens: Non-text tokens (e.g. turn delimiters) the model was trained with.
apply_chat_template: The function that renders messages into the correct formatted string.
add_generation_prompt: At inference, appends the assistant-turn opener so the model starts responding.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

Roles and special tokens

The chat template does this for you

The generation prompt

Where templates meet the loss mask

Key terms

Check yourself

Related lessons