Track 2 · Hands-on · Lesson 4

Tokenize and collate: model-ready batches with a loss mask

After this lesson you can tokenize SFT examples into input_ids and labels, mask the prompt tokens with -100 so the loss is computed only on the completion, and collate them into padded batches.

Level: intermediate Read time: ~11 min Prerequisites: Build a tiny SFT dataset

This is the lesson where Track 1's theory becomes code. We'll convert each (prompt, completion) pair into two aligned token sequences — input_ids (what the model reads) and labels (what it's scored against) — with the prompt portion of the labels set to -100, the special value that tells the loss to ignore those positions. That is the loss mask, made real.

Build input_ids and masked labels

We render the full conversation (prompt + completion) for input_ids, and separately render the prompt only (with the generation prompt) to know how many leading tokens to mask.

MAX_LEN = 256

def tokenize(row):
    full_msgs = [
        {"role": "user", "content": row["prompt"]},
        {"role": "assistant", "content": row["completion"]},
    ]
    prompt_msgs = [{"role": "user", "content": row["prompt"]}]

    full = tok.apply_chat_template(full_msgs, tokenize=False)
    prompt = tok.apply_chat_template(prompt_msgs, tokenize=False, add_generation_prompt=True)

    input_ids = tok(full, truncation=True, max_length=MAX_LEN, add_special_tokens=False)["input_ids"]
    prompt_ids = tok(prompt, add_special_tokens=False)["input_ids"]

    labels = list(input_ids)
    n_prompt = min(len(prompt_ids), len(labels))
    for i in range(n_prompt):
        labels[i] = -100                      # mask the prompt: no loss here

    return {"input_ids": input_ids,
            "attention_mask": [1] * len(input_ids),
            "labels": labels}

train_tok = train_ds.map(tokenize, remove_columns=train_ds.column_names)
val_tok = val_ds.map(tokenize, remove_columns=val_ds.column_names)

-100 is the loss mask

PyTorch's cross-entropy ignores any label position set to -100. By copying input_ids into labels and then overwriting the prompt positions with -100, we compute loss only on the completion — exactly the masking you learned in Track 1, now in three lines.

Inspect one example

Always look at a tokenized row before training. Confirm the completion tokens are the ones not masked.

ex = train_tok[0]
print("input_ids:", ex["input_ids"][:12], "...")
print("labels   :", ex["labels"][:12], "...")     # leading entries should be -100
n_supervised = sum(1 for x in ex["labels"] if x != -100)
print("supervised (completion) tokens:", n_supervised)

Collate into padded batches

Examples have different lengths; a batch must be rectangular. DataCollatorForSeq2Seq pads input_ids with the pad token, the attention_mask with 0, and — crucially — labels with -100, so padding never contributes to the loss either.

from transformers import DataCollatorForSeq2Seq

collator = DataCollatorForSeq2Seq(tok, model=model, padding=True)

# peek at one batch
import torch
batch = collator([train_tok[0], train_tok[1]])
print({k: tuple(v.shape) for k, v in batch.items()})

You'll see input_ids, attention_mask, and labels all padded to the same length, ready for the model. Two safety checks before moving on: confirm completions aren't being truncated (raise MAX_LEN if your data is longer), and confirm the leading label positions are -100. With model-ready batches, we can finally train — by attaching LoRA and handing everything to the Trainer.

Key terms

input_ids
The token IDs the model reads, from the full prompt+completion.
labels
Targets for the loss; copy of input_ids with masked positions set to -100.
-100 (ignore index)
The label value PyTorch cross-entropy ignores — used to mask prompt and padding.
attention_mask
1 for real tokens, 0 for padding.
.map()
Applies the tokenize function across the whole Dataset.
DataCollatorForSeq2Seq
Pads input_ids, attention_mask, and labels (labels with -100) into rectangular batches.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.