What is the attention mask for?

Telling the model to ignore padded positions

Why can naive truncation be a silent bug?

It can clip the completion you train on, teaching partial/empty targets

Sequence packing improves training by…

Concatenating short examples to avoid wasting compute on padding

How should you choose max_seq_length?

From your data's length distribution — fit completions, avoid paying for empty context

Track 1 · SFT fundamentals · Lesson 6

Tokenization in practice: padding, truncation, packing

After this lesson you can configure max sequence length sensibly, explain padding and attention masks, use sequence packing, and avoid truncation silently dropping your completions.

Level: beginner Read time: ~10 min Prerequisites: Task shapes: classification, QA, extraction, summarization, chat

You met tokenization in Track 0: text → subword tokens → IDs → embeddings. Training adds a handful of practical mechanics that, done wrong, quietly corrupt your data. None are hard; all are worth getting right.

Padding and the attention mask

The GPU processes a batch of sequences at once, and a batch must be rectangular — every row the same length. Real examples vary in length, so shorter ones are padded with a special pad token up to the batch's longest. To stop the model from treating padding as real content, each example carries an attention mask: a 1 for real tokens, 0 for padding, telling attention to ignore the padded positions. Padding affects only batching efficiency, never the meaning.

Truncation and max_seq_length

max_seq_length caps how many tokens an example may have. Anything longer is truncated — cut to fit. This is where a real bug hides: if you truncate from the end, you can chop off the completion (the part you actually train on), leaving the model nothing to learn. Choose max_seq_length from your data's actual length distribution, and when you must truncate, truncate the prompt/context side so the completion survives.

Silent data bug

Truncation that clips completions lets training run normally while teaching the model partial or empty targets. Always inspect your token-length distribution and confirm completions fit before a long run.

Sequence packing

If your examples are short and max_seq_length is large, most of each padded sequence is wasted compute. Sequence packing concatenates several short examples into one full-length sequence, so the GPU does useful work instead of multiplying padding. It needs care: examples shouldn't "attend across" each other (attention boundaries are reset per example) and position counting restarts at each boundary. Good trainers handle this; the payoff is a sizable throughput win on short-example datasets.

Padding side and cost

For causal (decoder-only) models, the padding side matters at inference — typically left-padding so the real tokens end at the sequence's end where generation continues. And remember the cost lever: attention work grows roughly with the square of sequence length, so a larger max_seq_length is more expensive per step. Pick the smallest length that comfortably fits your data.

Key idea

Set max_seq_length from your data, not a round number: large enough that completions are never clipped, small enough that you're not paying for empty context.

Key terms

Padding: Filling shorter sequences with a pad token so a batch is rectangular.
Attention mask: A 0/1 marker telling the model to ignore padded positions.
Truncation: Cutting sequences longer than max_seq_length; truncate the prompt side to protect the completion.
max_seq_length: The cap on tokens per example; set it from the data's length distribution.
Sequence packing: Concatenating short examples into one full-length sequence to avoid wasted padding compute.
Padding side: Which end padding is added; causal models often left-pad at inference.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

Padding and the attention mask

Truncation and max_seq_length

Sequence packing

Padding side and cost

Key terms

Check yourself

Related lessons