Speculative decoding: draft, verify, accept
After this lesson you can explain how speculative decoding turns sequential token generation into a propose-and-verify loop, do the back-of-envelope math for the expected speed-up from acceptance rate and draft cost, and decide whether it is worth wiring on for your target model — including the honest answer for SLM-sized targets.
Speculative decoding is one of the cleanest ideas in inference acceleration: instead of generating one token per forward pass, you have a small draft model guess the next few tokens and use the target model to verify the guess in a single pass. When the draft is right, you commit multiple tokens for the price of one target forward. When it is wrong, you fall back. This lesson is the mechanics, the math, and the honest verdict on whether you should bother for an SLM target.
The problem speculative decoding solves
Autoregressive generation is sequential by construction. To produce token t, the model needs the keys, values, and logits conditioned on tokens 1 through t-1. Causal masking makes parallel generation across future positions ill-defined — you cannot just predict five tokens in parallel from the same context, because token t+1's probability depends on the token actually chosen at position t.
A 7B target generating 100 tokens needs 100 sequential forward passes. On a GPU this is a memory-bandwidth-bound workload: each pass moves the model weights from HBM into compute, does a tiny amount of work for a single token, and writes the result back. The compute units are idle most of the wall clock. The math wants more tokens per forward pass, but the math of causal attention forbids it. Speculative decoding sneaks around that constraint by introducing a second, cheaper model whose guesses can be verified in parallel.
How speculative decoding works
One round of speculative decoding has three steps:
- Draft. A small draft model takes the current context and autoregressively generates K candidate tokens — typically K is 4 to 8. This is cheap because the draft is much smaller than the target.
- Verify. The target model runs one forward pass over the context plus all K draft tokens. Because all K positions are now fully specified, the target can compute logits at every draft position in parallel — that is just a normal batched-position forward, not a parallel generation across futures.
- Accept. Walk the draft tokens left to right. At each position, compare the draft's token to the target's distribution at that position. Under the standard rejection-sampling rule (Leviathan et al., 2022), the target accepts the draft token with probability
min(1, p_target / p_draft); on first rejection, the target samples a corrected token from a normalised residual distribution and discards the rest of the draft. For greedy decoding, the rule degenerates to "accept while the draft matches the target's argmax."
The output distribution is provably identical to running the target on its own — speculative decoding is exact, not approximate. The total cost of a round is one target forward pass plus K cheap draft forwards. If on average more than one token is accepted per round, you win.
The acceptance-rate equation
Let α be the per-token acceptance probability (how often the draft's next token agrees with the target), K the draft length, and c the draft-to-target cost ratio (draft forward time divided by target forward time). The expected speed-up over plain target decoding is approximately
speedup ≈ (1 − α^(K+1)) / ((1 − α) · (1 + c·K))
Concretely: with a target 10x more expensive than the draft (c = 0.1), K = 4, and α = 0.7, you get roughly 2.3x throughput. With α = 0.5 the same configuration gives only about 1.4x — because half the draft tokens are wasted work. With α = 0.9 you push past 3x. Real-world large-target/small-draft pairs typically land between 1.5x and 3x; published claims above 3x are usually for very high-α drafts (EAGLE) or specific benchmarks. Treat anything over 4x with suspicion until you have measured it on your traffic.
Two failure modes lurk in this equation. If α is low, the second term in the denominator (the wasted draft work) eats your gains. If c is high — the draft is not actually cheap relative to the target — the same denominator term grows. Both happen together when you try to use speculative decoding with a target that is already small.
Same-tokenizer constraint and draft-model alignment
Speculative decoding requires the draft and target to share a tokenizer. The verification step compares draft tokens to target logits at the same positions; if the two models tokenise differently, those positions do not line up and the comparison is meaningless. There are research variants that bridge tokenizer mismatches by re-tokenising on the fly, but they pay a substantial overhead and remain niche.
Within the same-tokenizer family, α depends on how well the draft mimics the target. A small model from the same training pipeline as the target (think Llama-3-8B as a draft for Llama-3-70B) is the natural pick: shared pretraining data, shared instruction-tuning conventions, shared safety post-training all push α up. A small model from a different family — even with the same tokenizer — typically gives much lower α because its distribution diverges in idiosyncratic ways. If you are picking a draft, prefer a same-family sibling over a generic small model.
EAGLE, Medusa, and the draft-model family
The base recipe uses a separately-trained small model as the draft. The frontier has moved on:
- EAGLE trains a tiny auxiliary head (a single decoder layer on top of the target's hidden states) to predict the target's next-token embedding directly. Because it is conditioned on the target's actual hidden states, EAGLE's α is much higher than a standalone small model's — published numbers reach 0.8+ on common benchmarks — at the cost of training a custom head per target.
- Medusa attaches several parallel decoding heads to the target itself, each predicting one extra position. A tree-attention verification step lets the target validate many candidate continuations in a single forward pass. There is no separate draft model in memory — useful when VRAM is tight — but the bookkeeping of tree attention is real engineering work.
Both raise the practical α-ceiling and both ship in production stacks today. If you are operating a large-target serving rig, EAGLE or Medusa is usually the right starting point over a standalone-draft recipe.
Why it matters less for SLMs than for big models
Notice that the speed-up equation contains c (draft-to-target cost) but not the absolute target cost. The wall-clock benefit of speculative decoding scales with how expensive the target's forward pass is. For a 70B target on a single GPU, one forward pass takes tens of milliseconds and the memory-bandwidth bottleneck is severe — there is plenty of room for a 7B draft (c ≈ 0.1) to add value. For a 135M SLM, one forward pass already runs in a millisecond or two, and a model 10x smaller is essentially a tokenizer wrapper plus a few embeddings: you cannot find a meaningfully cheaper draft.
Concretely, for SmolLM2-135M or a 1.7B fine-tune, the per-token cost is already small enough that the fixed overhead of running two models — draft warm-up, draft KV-cache, the verification accounting — eats most of the theoretical win. Practical experiments on SLM targets show speed-ups in the 1.1x to 1.5x range with a well-chosen draft, and frequently regressions when the draft is poorly matched. Speculative decoding is worth wiring on for an SLM only when (a) you have a much smaller specialised draft such as an n-gram model or a 10M-parameter shallow head, or (b) the SLM is being served alongside a larger fallback model where speculative decoding bridges the two.
For the typical SLM serving setup, the bigger wall-clock wins come from KV-cache management, continuous batching, and quantization (covered in Lesson 4.8 — serving and inference). Those move the latency needle reliably; speculative decoding on an SLM target moves it marginally and only with care.
Tooling pointers
Three production stacks expose speculative decoding today:
- vLLM has first-class support for speculative decoding with both standalone drafts and Medusa / EAGLE heads. The configuration is one block in the
LLMconstructor. - llama.cpp supports draft-model speculative decoding via the
--draftfamily of flags on the server and CLI. Same-tokenizer constraint applies; bring a matched GGUF for the draft. - HuggingFace Transformers exposes the simplest version: an
assistant_modelargument ongenerate()that turns on draft-and-verify with whatever model you pass.
from transformers import AutoModelForCausalLM, AutoTokenizer
target_id = "meta-llama/Llama-3.1-8B-Instruct"
draft_id = "meta-llama/Llama-3.2-1B-Instruct" # same tokenizer family
tok = AutoTokenizer.from_pretrained(target_id)
target = AutoModelForCausalLM.from_pretrained(target_id, torch_dtype="bfloat16", device_map="auto")
draft = AutoModelForCausalLM.from_pretrained(draft_id, torch_dtype="bfloat16", device_map="auto")
inputs = tok("Explain speculative decoding in two sentences.", return_tensors="pt").to(target.device)
out = target.generate(
**inputs,
assistant_model=draft, # turn on speculative decoding
num_assistant_tokens=5, # K = draft length per round
max_new_tokens=200,
do_sample=False, # greedy verification
)
print(tok.decode(out[0], skip_special_tokens=True))
The same recipe in vLLM looks like LLM(model=target_id, speculative_model=draft_id, num_speculative_tokens=5); in llama.cpp it is ./llama-server -m target.gguf --draft 5 -md draft.gguf. Measure throughput and tokens-per-second before and after; if α is poor you will see a regression, not a gain.
Honest beat — modest gains on SLM targets
For an SLM target (135M to 1.7B), speculative decoding's realistic margin is usually 1.2x to 1.8x at best — and that assumes you have a draft that meaningfully outpaces the target. With a poorly-matched draft the wall-clock can go the wrong way. KV-cache reuse, continuous batching, and quantization will move the latency needle more than speculative decoding on a 135M–1.7B target. Treat speculative decoding as the last optimisation you reach for on an SLM, not the first.
What to measure if you turn it on
Three numbers matter and one of them is the one most teams forget to record:
- Acceptance rate α. Log it per request and per workload. If α collapses on a new prompt distribution, the draft no longer matches the target there.
- Mean accepted tokens per round. Equal to
(1 − α^(K+1)) / (1 − α)in the limit; in practice you want it above 2 to justify the bookkeeping. - End-to-end tokens-per-second on your real traffic, with and without speculative decoding. Synthetic prompts overstate the win; conversational traffic with short turns understates it. Measure both and report the median, not the best run.
Key idea
Speculative decoding is exact, not approximate: a small draft proposes K tokens and the target verifies them in a single forward pass, so the output distribution is unchanged. The win is the math (1 − α^(K+1)) / ((1 − α) · (1 + c·K)), governed by acceptance rate, draft length, and draft-to-target cost ratio. It scales with target cost — which is why a 70B target sees 2–3x and an SLM target sees 1.2–1.8x at best. For SLM serving, KV-cache, batching, and quantization usually pay better.
Key terms
- Speculative decoding
- An inference-time acceleration technique where a small draft model proposes K tokens and the target model verifies them in a single forward pass, accepting matching tokens and falling back to a target correction on first disagreement.
- Draft model
- The small, fast model that proposes candidate tokens during speculative decoding; must share a tokenizer with the target and ideally come from the same model family for high α.
- Target model
- The model whose output distribution you actually want; verifies the draft's K candidates in parallel and provides the corrected token on rejection.
- Acceptance rate (α)
- The probability that the target accepts a draft token at a given position; the dominant driver of speculative decoding's speed-up.
- Draft length (K)
- The number of tokens the draft proposes per round before the target verifies; typical values are 4 to 8.
- EAGLE
- A speculative-decoding variant that trains a tiny auxiliary head on top of the target's hidden states to predict the target's next embedding, achieving very high α at the cost of training a custom head per target.
- Medusa
- A variant that attaches several parallel decoding heads to the target itself and uses tree attention to verify many candidate continuations in one pass, removing the separate draft model.
- Tree attention
- A verification scheme that lets a model score a tree of candidate continuations in a single forward pass; used by Medusa and other speculative-decoding variants.
- Same-tokenizer constraint
- The hard requirement that draft and target share a tokenizer so that draft positions line up with target logit positions during verification.
Check yourself
Answers are saved to this browser.