Learning rate, schedules, warmup, epochs vs steps
After this lesson you can choose a sensible learning rate for fine-tuning, explain warmup and a decay schedule, and reason in both epochs and steps.
The learning rate set how big a step gradient descent takes. In fine-tuning it is the knob you'll touch most and the one most able to make or break a run. This lesson gives you defaults that work and the intuition to adjust them.
Sensible ranges
Fine-tuning uses small learning rates, because you're nudging an already-capable model, not training from scratch. Typical ranges: around 1e-5 to 5e-5 for full fine-tuning, and a bit higher — roughly 1e-4 to 3e-4 — for LoRA (which trains far fewer parameters and tolerates larger steps). These are starting points, not laws; the right value depends on your data and model.
Reading a wrong LR
Too high: loss spikes, oscillates wildly, or goes to NaN. Too low: loss descends painfully slowly or barely at all. When in doubt, change the LR by factors of 3–10 (not 10%) to find the right order of magnitude first.
Warmup
Starting at the full learning rate on step one can destabilize a model whose optimizer statistics are still cold. Warmup ramps the LR from 0 up to its target over the first small fraction of training (often a few percent of steps), then hands off to the main schedule. It's a cheap insurance policy against early divergence; a warmup ratio like 0.03 is common.
The decay schedule
After warmup, the LR typically decays over training so the model takes big steps early (to move fast) and small steps late (to settle precisely). The cosine schedule — a smooth decay following a cosine curve down toward zero — is the common default. Linear decay is also fine. The shape matters less than having some decay: ending at a high LR tends to leave the model bouncing instead of converging.
Epochs vs steps
Two ways to count training length. An epoch is one full pass over the training data; a step is one parameter update (one minibatch). They relate by steps_per_epoch = examples / effective_batch_size. Fine-tuning usually runs a small number of epochs (often 1–3) — too many and the model starts memorizing. Schedules are usually defined over total steps, so the LR decays to near-zero exactly as training ends. Knowing both lets you translate "train for 3 epochs" into the step count your scheduler needs.
Closely tied to the learning rate is how many examples each step sees — the batch size and gradient accumulation — which is next.
Key terms
- Learning rate
- The step-size multiplier; the most important fine-tuning knob (small values: ~1e-5–5e-5 full, ~1e-4–3e-4 LoRA).
- Warmup
- Ramping the LR from 0 to target over the first few percent of steps to avoid early divergence.
- Cosine schedule
- A smooth LR decay toward zero over training; big steps early, small steps late.
- Epoch
- One full pass over the training data.
- Step
- One parameter update (one minibatch); schedules are usually defined over total steps.
Check yourself
Answers are saved to this browser.