How models learn: loss & gradient descent
After this lesson you can describe, precisely, the loop that turns "nudge the knobs" into an algorithm — loss, gradient, learning rate — and explain what an epoch and a minibatch are.
In the last lesson we said training is a loop: predict, measure how wrong you are, nudge the parameters to be less wrong, repeat. That word "nudge" was doing a lot of work. This lesson replaces it with an actual algorithm — gradient descent — which is, with surprisingly few changes, how every model in this course is trained.
Step one: turn "wrong" into a number
You can't minimize something you can't measure. So the first ingredient is a loss function: it takes the model's prediction and the true answer and returns a single number that is large when the prediction is bad and small when it's good.
For the house-price model, a natural loss is the squared error: take the gap between predicted and actual price, and square it. Predict 250k when the truth is 240k and the loss for that example is (250k − 240k)² = 100,000,000. Square it so over- and under-shooting both count as positive error, and bigger mistakes are penalized harder. The loss over a dataset is just the average over all examples.
Different tasks use different loss functions — language models use cross-entropy, which we'll meet in Track 1 — but the role is always the same: one number, smaller is better.
The loss is a landscape over the parameters
Here is the key mental shift. Hold the data fixed and think of the loss as a function of the parameters. For the two-knob house model, every choice of (w, b) produces some average loss. Picture that as a landscape: the two knobs are your east–west and north–south position, and the loss is the altitude. Training is the search for the lowest valley.
Key idea
Training does not search over predictions; it searches over parameters. The data defines the shape of the landscape; the parameters are where you're standing on it.
Gradient descent: roll downhill
How do you get to the valley when you can only feel the ground right under your feet? You measure the slope and step downhill. In more than one dimension the slope is called the gradient: it points in the direction the loss increases fastest. So to decrease the loss, you step in the opposite direction. That's the whole idea — gradient descent:
for each step:
prediction = model(inputs) # forward pass
loss = loss_fn(prediction, truth)
gradient = how loss changes w.r.t. each parameter
parameter = parameter − learning_rate × gradient # the nudge
That last line is "the nudge," now precise: move every parameter a little, against its gradient. Repeat thousands of times and you descend into a valley of low loss.
The learning rate: how big a step
The learning rate is the multiplier on the step. It is the single most important number you'll tune. Too small and training crawls, taking forever to reach the valley. Too large and you leap clear over the valley and bounce around — or diverge entirely, with the loss shooting to infinity. Typical values for fine-tuning are small, like 2e-4 (0.0002); we'll discuss why in Track 1.
Common failure
A loss that explodes to NaN after a few steps almost always means the learning rate is too high. A loss that barely moves means it's too low. Reading the loss curve is a skill you'll use constantly.
Minibatches and epochs
Computing the gradient over the entire dataset for every step is accurate but slow. Instead we use a minibatch: a small random handful of examples (say 8 or 32). Each step computes the gradient on one minibatch and updates. Because the batch is a random sample, the gradient is a noisy estimate of the true one — hence the name stochastic gradient descent (SGD). The noise is mostly harmless and often helpful.
One full pass through all the training data is an epoch. You typically train for a few epochs. "Train for 3 epochs with batch size 8" means: shuffle the data, walk through it in groups of 8 updating each time, and do that whole walk 3 times.
Does it actually find the bottom?
In theory the landscape can have dips that aren't the deepest valley (local minima) and flat plateaus. In practice, for the large models we train, gradient descent reliably finds parameter settings that are good enough — that's the empirical miracle deep learning runs on. You won't be proving convergence; you'll be watching a loss curve go down and stop when it stops improving.
That's the engine. Everything else — neural networks, Transformers, LoRA — changes what the model computes and which parameters exist, but the training loop stays this loop. Next we look at the kind of model that makes language possible: the neural network.
Key terms
- Loss function
- A function returning one number measuring how wrong the model's predictions are; smaller is better.
- Gradient
- The direction (over all parameters) in which the loss increases fastest; we step the opposite way.
- Gradient descent
- Repeatedly nudging parameters against the gradient to reduce loss.
- Learning rate
- The step-size multiplier; too high diverges, too low crawls.
- Minibatch / SGD
- Estimating the gradient on a small random batch each step (stochastic gradient descent).
- Epoch
- One complete pass through the training dataset.
Check yourself
Four questions. Answers are saved to this browser.