Track 1 · SFT fundamentals · Lesson 19

Decoding controls: temperature, top-p, stop tokens

After this lesson you can name what each decoding knob does (temperature, top-p, top-k, max_new_tokens, stop tokens, repetition penalty), tune them for a use case, and recognise when a bad output is a decoding problem rather than a training problem.

Level: beginner Read time: ~9 min Prerequisites: Next-token prediction

The same trained model can produce wildly different outputs on the same prompt depending on how you decode. Decoding is the bridge from "a probability distribution over the next token" to "actual text" — and it is governed by knobs that have nothing to do with training. Every developer eventually mistakes a decoding problem for a training problem, retrains, and discovers nothing changed. This lesson is so you skip that step.

Decoding is not training

The model produces a probability for every token in the vocabulary at every position (Lesson 0.6). Decoding is the rule that picks a token from that distribution. The training process determines the distribution; the decoding rule determines which token in it you actually emit. Tuning decoding does not change a single weight — it changes which path you take through the model's predictions.

Greedy vs sampling

Two families of decoding rule. Greedy always takes the most likely token. It is deterministic — same prompt, same output — and tends toward the average; great for evaluation, can be repetitive for generation. Sampling draws a token from the distribution. It is non-deterministic, adds variety, and is the right default for chat / writing tasks where blandness is itself a failure.

In the HF Transformers API: do_sample=False is greedy; do_sample=True is sampling. For classification or structured extraction you almost always want greedy. For chat / FAQ-style generation you almost always want sampling.

Temperature: sharpen or flatten the distribution

Temperature (T) divides the logits before softmax. T = 1 is the trained distribution. T < 1 sharpens — high-probability tokens become more likely, low-probability tokens become less. T > 1 flattens — more variety, more risk. Extremes are easy to reason about: T → 0 approaches greedy; T → ∞ approaches uniform random.

Top-p (nucleus) and top-k: chop off the tail

Temperature reshapes the distribution but still allows any token. Top-p and top-k truncate the tail before sampling, so the model never picks an obviously bad token.

The standard combo for chat is temperature=0.7, top_p=0.9, top_k=0 (top-k off, nucleus on). For deterministic tasks all of these are off (greedy).

Stop tokens and max_new_tokens: don't let it ramble

Stop tokens tell the model to halt generation when it produces a particular token (often the chat template's end-of-turn marker, e.g. <|im_end|>). Without one — or with the wrong one — the model keeps going past your answer and emits the next user turn it imagines, or just trails off. Instruct models bake their end tokens into the tokenizer; chat templates set them automatically.

max_new_tokens is a hard cap. Always set it. A chat answer with a forgotten stop token and no cap will fill its context window with hallucinated conversation. For a classification task that emits one label, max_new_tokens = 4 is fine; for a chat answer, 256–512.

Repetition penalty: small fix for a common bug

Sometimes a model loops: it generates a phrase, then re-generates it, then re-generates it. The repetition penalty reduces the probability of any token that has already appeared in the output. A small value (1.051.15) usually breaks the loop with no quality cost. Don't go above ~1.3 — you'll start suppressing legitimately repeated words like "the."

Honest beat — repetitive output is two different bugs

Repetitive output can mean a decoding problem or a training problem. First, check decoding: turn on repetition_penalty=1.1 or switch from greedy to top_p=0.9 sampling. If the variety returns, it was the decoding rule trapping you in a locally-likely loop — no retrain needed. If the model still loops with healthy decoding, it has overfit on repetitive training data and a retrain on more diverse examples is the right answer. Knowing which costs you a five-minute experiment instead of a five-hour rerun.

Time to first token (TTFT) and streaming

Decoding settings affect throughput (tokens per second) but the user-visible latency starts before any token is produced: time to first token (TTFT) is the gap between sending the prompt and seeing the first character of reply. For a chat UX, TTFT under ~200 ms feels instant; over ~1 s feels broken. Streaming — emitting tokens as they're produced rather than buffering the whole answer — masks the rest of the latency by giving the user something to read immediately. Most serving stacks support it; turn it on for any interactive use.

Key idea

Decoding is not training. Before you retrain, try greedy vs sampling, temperature ≈ 0.7, top_p ≈ 0.9, a stop token, max_new_tokens, and a small repetition_penalty. Most "bad output" complaints have a decoding fix that costs five minutes. Save the retrain for when the model genuinely doesn't know.

Decoding lives at inference time, but it interacts with everything you've trained. Track 2 will pick the actual values in code; Lesson 1.20 covers another pre-training-time concern: the dataset format you've been handed.

Key terms

Greedy decoding
Always take the most-likely next token. Deterministic; the default for evaluation and classification.
Sampling
Draw the next token from the distribution; non-deterministic, adds variety, the default for chat / writing.
Temperature (T)
Divides logits before softmax. T < 1 sharpens the distribution; T > 1 flattens it.
Top-p (nucleus sampling)
Keep the smallest set of tokens whose cumulative probability ≥ p and sample from that set. Adapts to distribution shape; p ≈ 0.9 is a standard default.
Top-k
Keep only the k most likely tokens and sample from that subset.
max_new_tokens
Hard cap on how many tokens the model may emit per call. Always set it.
Stop tokens
Tokens that signal end-of-generation. Instruct models bake one into the chat template; without it the model rambles.
Repetition penalty
A scalar that reduces probability of tokens already in the output. Small values (1.05–1.15) break loops without quality loss.
Time to first token (TTFT)
Latency from request send to first generated character. Dominates perceived chat UX.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.