Track 1 · SFT fundamentals · Lesson 12

Batch size, gradient accumulation, and the effective batch

After this lesson you can explain what batch size does, use gradient accumulation to fit a large effective batch on a small GPU, and reason about the effective batch as the number that really matters.

Level: beginner Read time: ~8 min Prerequisites: Learning rate, schedules, warmup, epochs vs steps

How many examples should each training step see at once? That's the batch size, and it trades off gradient quality against memory. When memory runs short — which it will on a single GPU — gradient accumulation is the trick that rescues you.

What batch size does

Each step estimates the gradient from a minibatch (Track 0). A larger batch averages over more examples, so the gradient is smoother and less noisy — training is more stable, though sometimes a little noise actually helps generalization. A smaller batch is noisier but uses far less memory. The catch: every example in the batch must fit in GPU memory simultaneously (activations for all of them), so batch size is often capped not by what you want but by what fits.

Gradient accumulation

Suppose you want the stability of a batch of 32 but only 8 fit in memory. Gradient accumulation solves this: run 4 micro-batches of 8, accumulating their gradients without updating, then take one optimizer step using the summed gradient. The model sees 32 examples' worth of gradient per update while only ever holding 8 in memory at once.

for i, micro_batch in enumerate(batches):

  loss = forward(micro_batch) / accum_steps # scale so the sum averages correctly

  loss.backward() # accumulate gradients

  if (i + 1) % accum_steps == 0:

    optimizer.step(); optimizer.zero_grad() # update once per accum_steps

The number that matters: effective batch size

What actually governs training dynamics is the effective batch size:

effective_batch = per_device_batch_size × gradient_accumulation_steps × num_devices

A per-device batch of 8 with 4 accumulation steps on one GPU gives an effective batch of 32 — identical dynamics to a single GPU that could hold 32 at once. So you tune the effective batch for the result you want, then split it into whatever per-device size your memory allows via accumulation.

Key idea

Think in effective batch size, not per-device batch size. Gradient accumulation lets a modest GPU reproduce the training dynamics of a much bigger one — trading wall-clock time (more forward/backward passes per update) for memory you don't have.

Interaction with the learning rate

Batch size and learning rate are linked: larger effective batches give cleaner gradients that can tolerate (and often want) a somewhat larger learning rate, and vice versa. You don't need a precise formula at this stage — just know that if you substantially change the effective batch, you may need to revisit the LR. With the loop, loss, LR, and batch understood, the next question is which parameters you update — full fine-tuning or LoRA.

Key terms

Batch size
How many examples a step processes at once; larger = smoother gradient but more memory.
Gradient accumulation
Summing gradients over several micro-batches before one optimizer step, to simulate a larger batch.
Effective batch size
per_device_batch × accumulation_steps × devices — the number that governs training dynamics.
Gradient noise
Variance in the minibatch gradient estimate; smaller batches are noisier.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.