Fine-tuning learning rates are typically…

Small (e.g. ~1e-5–5e-5 full, ~1e-4–3e-4 LoRA)

Ramping the LR from 0 up to target over the first steps

Why decay the learning rate over training?

Big steps early to move fast, small steps late to settle precisely

Track 1 · SFT fundamentals · Lesson 11

Learning rate, schedules, warmup, epochs vs steps

After this lesson you can choose a sensible learning rate for fine-tuning, explain warmup and a decay schedule, and reason in both epochs and steps.

Level: beginner Read time: ~9 min Prerequisites: Cross-entropy loss for token prediction

The learning rate set how big a step gradient descent takes. In fine-tuning it is the knob you'll touch most and the one most able to make or break a run. This lesson gives you defaults that work and the intuition to adjust them.

Sensible ranges

Fine-tuning uses small learning rates, because you're nudging an already-capable model, not training from scratch. Typical ranges: around 1e-5 to 5e-5 for full fine-tuning, and a bit higher — roughly 1e-4 to 3e-4 — for LoRA (which trains far fewer parameters and tolerates larger steps). These are starting points, not laws; the right value depends on your data and model.

Reading a wrong LR

Too high: loss spikes, oscillates wildly, or goes to NaN. Too low: loss descends painfully slowly or barely at all. When in doubt, change the LR by factors of 3–10 (not 10%) to find the right order of magnitude first.

Warmup

Starting at the full learning rate on step one can destabilize a model whose optimizer statistics are still cold. Warmup ramps the LR from 0 up to its target over the first small fraction of training (often a few percent of steps), then hands off to the main schedule. It's a cheap insurance policy against early divergence; a warmup ratio like 0.03 is common.

The decay schedule

After warmup, the LR typically decays over training so the model takes big steps early (to move fast) and small steps late (to settle precisely). The cosine schedule — a smooth decay following a cosine curve down toward zero — is the common default. Linear decay is also fine. The shape matters less than having some decay: ending at a high LR tends to leave the model bouncing instead of converging.

Warmup ramps the LR up over the first few percent of steps; a cosine schedule then decays it toward zero.

Epochs vs steps

Two ways to count training length. An epoch is one full pass over the training data; a step is one parameter update (one minibatch). They relate by steps_per_epoch = examples / effective_batch_size. Fine-tuning usually runs a small number of epochs (often 1–3) — too many and the model starts memorizing. Schedules are usually defined over total steps, so the LR decays to near-zero exactly as training ends. Knowing both lets you translate "train for 3 epochs" into the step count your scheduler needs.

Closely tied to the learning rate is how many examples each step sees — the batch size and gradient accumulation — which is next.

Key terms

Learning rate: The step-size multiplier; the most important fine-tuning knob (small values: ~1e-5–5e-5 full, ~1e-4–3e-4 LoRA).
Warmup: Ramping the LR from 0 to target over the first few percent of steps to avoid early divergence.
Cosine schedule: A smooth LR decay toward zero over training; big steps early, small steps late.
Epoch: One full pass over the training data.
Step: One parameter update (one minibatch); schedules are usually defined over total steps.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

Sensible ranges

Warmup

The decay schedule

Epochs vs steps

Key terms

Check yourself

Related lessons