Tokenize and collate: model-ready batches with a loss mask
After this lesson you can tokenize SFT examples into input_ids and labels, mask the prompt tokens with -100 so the loss is computed only on the completion, and collate them into padded batches.
This is the lesson where Track 1's theory becomes code. We'll convert each (prompt, completion) pair into two aligned token sequences — input_ids (what the model reads) and labels (what it's scored against) — with the prompt portion of the labels set to -100, the special value that tells the loss to ignore those positions. That is the loss mask, made real.
Build input_ids and masked labels
We render the full conversation (prompt + completion) for input_ids, and separately render the prompt only (with the generation prompt) to know how many leading tokens to mask.
MAX_LEN = 256
def tokenize(row):
full_msgs = [
{"role": "user", "content": row["prompt"]},
{"role": "assistant", "content": row["completion"]},
]
prompt_msgs = [{"role": "user", "content": row["prompt"]}]
full = tok.apply_chat_template(full_msgs, tokenize=False)
prompt = tok.apply_chat_template(prompt_msgs, tokenize=False, add_generation_prompt=True)
input_ids = tok(full, truncation=True, max_length=MAX_LEN, add_special_tokens=False)["input_ids"]
prompt_ids = tok(prompt, add_special_tokens=False)["input_ids"]
labels = list(input_ids)
n_prompt = min(len(prompt_ids), len(labels))
for i in range(n_prompt):
labels[i] = -100 # mask the prompt: no loss here
return {"input_ids": input_ids,
"attention_mask": [1] * len(input_ids),
"labels": labels}
train_tok = train_ds.map(tokenize, remove_columns=train_ds.column_names)
val_tok = val_ds.map(tokenize, remove_columns=val_ds.column_names)
-100 is the loss mask
PyTorch's cross-entropy ignores any label position set to -100. By copying input_ids into labels and then overwriting the prompt positions with -100, we compute loss only on the completion — exactly the masking you learned in Track 1, now in three lines.
Inspect one example
Always look at a tokenized row before training. Confirm the completion tokens are the ones not masked.
ex = train_tok[0]
print("input_ids:", ex["input_ids"][:12], "...")
print("labels :", ex["labels"][:12], "...") # leading entries should be -100
n_supervised = sum(1 for x in ex["labels"] if x != -100)
print("supervised (completion) tokens:", n_supervised)
Collate into padded batches
Examples have different lengths; a batch must be rectangular. DataCollatorForSeq2Seq pads input_ids with the pad token, the attention_mask with 0, and — crucially — labels with -100, so padding never contributes to the loss either.
from transformers import DataCollatorForSeq2Seq
collator = DataCollatorForSeq2Seq(tok, model=model, padding=True)
# peek at one batch
import torch
batch = collator([train_tok[0], train_tok[1]])
print({k: tuple(v.shape) for k, v in batch.items()})
You'll see input_ids, attention_mask, and labels all padded to the same length, ready for the model. Two safety checks before moving on: confirm completions aren't being truncated (raise MAX_LEN if your data is longer), and confirm the leading label positions are -100. With model-ready batches, we can finally train — by attaching LoRA and handing everything to the Trainer.
Key terms
- input_ids
- The token IDs the model reads, from the full prompt+completion.
- labels
- Targets for the loss; copy of input_ids with masked positions set to -100.
- -100 (ignore index)
- The label value PyTorch cross-entropy ignores — used to mask prompt and padding.
- attention_mask
- 1 for real tokens, 0 for padding.
- .map()
- Applies the tokenize function across the whole Dataset.
- DataCollatorForSeq2Seq
- Pads input_ids, attention_mask, and labels (labels with -100) into rectangular batches.
Check yourself
Answers are saved to this browser.