Tokenization in practice: padding, truncation, packing
After this lesson you can configure max sequence length sensibly, explain padding and attention masks, use sequence packing, and avoid truncation silently dropping your completions.
You met tokenization in Track 0: text → subword tokens → IDs → embeddings. Training adds a handful of practical mechanics that, done wrong, quietly corrupt your data. None are hard; all are worth getting right.
Padding and the attention mask
The GPU processes a batch of sequences at once, and a batch must be rectangular — every row the same length. Real examples vary in length, so shorter ones are padded with a special pad token up to the batch's longest. To stop the model from treating padding as real content, each example carries an attention mask: a 1 for real tokens, 0 for padding, telling attention to ignore the padded positions. Padding affects only batching efficiency, never the meaning.
Truncation and max_seq_length
max_seq_length caps how many tokens an example may have. Anything longer is truncated — cut to fit. This is where a real bug hides: if you truncate from the end, you can chop off the completion (the part you actually train on), leaving the model nothing to learn. Choose max_seq_length from your data's actual length distribution, and when you must truncate, truncate the prompt/context side so the completion survives.
Silent data bug
Truncation that clips completions lets training run normally while teaching the model partial or empty targets. Always inspect your token-length distribution and confirm completions fit before a long run.
Sequence packing
If your examples are short and max_seq_length is large, most of each padded sequence is wasted compute. Sequence packing concatenates several short examples into one full-length sequence, so the GPU does useful work instead of multiplying padding. It needs care: examples shouldn't "attend across" each other (attention boundaries are reset per example) and position counting restarts at each boundary. Good trainers handle this; the payoff is a sizable throughput win on short-example datasets.
Padding side and cost
For causal (decoder-only) models, the padding side matters at inference — typically left-padding so the real tokens end at the sequence's end where generation continues. And remember the cost lever: attention work grows roughly with the square of sequence length, so a larger max_seq_length is more expensive per step. Pick the smallest length that comfortably fits your data.
Key idea
Set max_seq_length from your data, not a round number: large enough that completions are never clipped, small enough that you're not paying for empty context.
Key terms
- Padding
- Filling shorter sequences with a pad token so a batch is rectangular.
- Attention mask
- A 0/1 marker telling the model to ignore padded positions.
- Truncation
- Cutting sequences longer than max_seq_length; truncate the prompt side to protect the completion.
- max_seq_length
- The cap on tokens per example; set it from the data's length distribution.
- Sequence packing
- Concatenating short examples into one full-length sequence to avoid wasted padding compute.
- Padding side
- Which end padding is added; causal models often left-pad at inference.
Check yourself
Answers are saved to this browser.