Track 1 · SFT fundamentals · Lesson 15

GPU memory math

After this lesson you can estimate the GPU memory a fine-tune will use, explain why full fine-tuning needs several times the model size, and see exactly where LoRA and mixed precision save memory.

Level: intermediate Read time: ~10 min Prerequisites: LoRA knobs: rank, alpha, dropout, target modules, QLoRA

"Will this fit on my GPU?" is a question you'll ask constantly, and you can answer it with arithmetic. There are four consumers of training memory; knowing them turns out-of-memory errors from mysteries into predictions.

The four consumers

Why full fine-tuning needs several times the model size

Add the first three for a fully-fine-tuned 1B model at bf16: ~2 (weights) + ~2 (gradients) + ~8 (AdamW state) = ~12 GB before activations. That's the rule of thumb "full fine-tuning needs several times the model's size in memory." It's why a model whose weights fit easily can still refuse to train.

Where LoRA saves

With LoRA the base is frozen, so gradients and optimizer state exist only for the tiny adapter — the ~2 GB + ~8 GB above nearly vanish. You pay full weight memory (and can quarter that with QLoRA's 4-bit base) plus a sliver for the adapter. This is the concrete reason LoRA fits where full fine-tuning doesn't.

Mixed precision

Mixed precision stores and computes most things in 16-bit (bf16/fp16) instead of 32-bit, roughly halving weight, gradient, and activation memory while keeping training stable. bf16 is preferred on modern GPUs for its numerical range. It's nearly free quality-wise and standard practice.

Managing activations

Activations are the lever you control at run time. They grow with batch size × sequence length, so the two biggest knobs for fitting a run are exactly those. Gradient checkpointing (next lesson) trades compute to shrink activation memory dramatically. Combined with the earlier lessons: pick max_seq_length from your data, set a per-device batch that fits, and recover the effective batch with gradient accumulation.

You now have a mental calculator: weights + gradients + optimizer state + activations, scaled by precision. When the sum exceeds your GPU, you get an OOM — and the next lesson is the field guide to surviving exactly that.

Key terms

Weights memory
≈ params × bytes-per-param (2 bytes at bf16).
Gradient memory
One value per trainable parameter (≈ weights size for the trained params).
Optimizer state
AdamW's ~8 bytes per trainable parameter — often the biggest consumer in full fine-tuning.
Activations
Forward-pass intermediates kept for backprop; scale with batch size × sequence length.
Mixed precision (bf16)
Computing/storing in 16-bit to roughly halve memory with stable training.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.