GPU memory math
After this lesson you can estimate the GPU memory a fine-tune will use, explain why full fine-tuning needs several times the model size, and see exactly where LoRA and mixed precision save memory.
"Will this fit on my GPU?" is a question you'll ask constantly, and you can answer it with arithmetic. There are four consumers of training memory; knowing them turns out-of-memory errors from mysteries into predictions.
The four consumers
- Weights — the parameters themselves. ≈ params × bytes-per-param. At bf16 (2 bytes), a 1B model ≈ 2 GB.
- Gradients — one per trainable parameter, same size as the weights you're training. Full fine-tuning: another ≈ 2 GB for a 1B model.
- Optimizer state — AdamW keeps two extra values per trainable parameter (and often in 32-bit), so roughly 8 bytes per trainable parameter — frequently the largest single consumer. For a 1B model fully fine-tuned, ≈ 8 GB.
- Activations — intermediate values from the forward pass, kept for the backward pass. These scale with batch size and sequence length (not parameter count), and can be large.
Why full fine-tuning needs several times the model size
Add the first three for a fully-fine-tuned 1B model at bf16: ~2 (weights) + ~2 (gradients) + ~8 (AdamW state) = ~12 GB before activations. That's the rule of thumb "full fine-tuning needs several times the model's size in memory." It's why a model whose weights fit easily can still refuse to train.
Where LoRA saves
With LoRA the base is frozen, so gradients and optimizer state exist only for the tiny adapter — the ~2 GB + ~8 GB above nearly vanish. You pay full weight memory (and can quarter that with QLoRA's 4-bit base) plus a sliver for the adapter. This is the concrete reason LoRA fits where full fine-tuning doesn't.
Mixed precision
Mixed precision stores and computes most things in 16-bit (bf16/fp16) instead of 32-bit, roughly halving weight, gradient, and activation memory while keeping training stable. bf16 is preferred on modern GPUs for its numerical range. It's nearly free quality-wise and standard practice.
Managing activations
Activations are the lever you control at run time. They grow with batch size × sequence length, so the two biggest knobs for fitting a run are exactly those. Gradient checkpointing (next lesson) trades compute to shrink activation memory dramatically. Combined with the earlier lessons: pick max_seq_length from your data, set a per-device batch that fits, and recover the effective batch with gradient accumulation.
You now have a mental calculator: weights + gradients + optimizer state + activations, scaled by precision. When the sum exceeds your GPU, you get an OOM — and the next lesson is the field guide to surviving exactly that.
Key terms
- Weights memory
- ≈ params × bytes-per-param (2 bytes at bf16).
- Gradient memory
- One value per trainable parameter (≈ weights size for the trained params).
- Optimizer state
- AdamW's ~8 bytes per trainable parameter — often the biggest consumer in full fine-tuning.
- Activations
- Forward-pass intermediates kept for backprop; scale with batch size × sequence length.
- Mixed precision (bf16)
- Computing/storing in 16-bit to roughly halve memory with stable training.
Check yourself
Answers are saved to this browser.