In full fine-tuning, which is often the LARGEST memory consumer?

AdamW optimizer state (~8 bytes/param)

Why does LoRA fit where full fine-tuning doesn't?

Gradients and optimizer state exist only for the tiny adapter, not the frozen base

What does mixed precision (bf16) do?

Roughly halves weight/gradient/activation memory with stable training

Track 1 · SFT fundamentals · Lesson 15

GPU memory math

After this lesson you can estimate the GPU memory a fine-tune will use, explain why full fine-tuning needs several times the model size, and see exactly where LoRA and mixed precision save memory.

Level: intermediate Read time: ~10 min Prerequisites: LoRA knobs: rank, alpha, dropout, target modules, QLoRA

"Will this fit on my GPU?" is a question you'll ask constantly, and you can answer it with arithmetic. There are four consumers of training memory; knowing them turns out-of-memory errors from mysteries into predictions.

The four consumers

Weights — the parameters themselves. ≈ params × bytes-per-param. At bf16 (2 bytes), a 1B model ≈ 2 GB.
Gradients — one per trainable parameter, same size as the weights you're training. Full fine-tuning: another ≈ 2 GB for a 1B model.
Optimizer state — AdamW keeps two extra values per trainable parameter (and often in 32-bit), so roughly 8 bytes per trainable parameter — frequently the largest single consumer. For a 1B model fully fine-tuned, ≈ 8 GB.
Activations — intermediate values from the forward pass, kept for the backward pass. These scale with batch size and sequence length (not parameter count), and can be large.

Why full fine-tuning needs several times the model size

Add the first three for a fully-fine-tuned 1B model at bf16: ~2 (weights) + ~2 (gradients) + ~8 (AdamW state) = ~12 GB before activations. That's the rule of thumb "full fine-tuning needs several times the model's size in memory." It's why a model whose weights fit easily can still refuse to train.

Where LoRA saves

With LoRA the base is frozen, so gradients and optimizer state exist only for the tiny adapter — the ~2 GB + ~8 GB above nearly vanish. You pay full weight memory (and can quarter that with QLoRA's 4-bit base) plus a sliver for the adapter. This is the concrete reason LoRA fits where full fine-tuning doesn't.

Mixed precision

Mixed precision stores and computes most things in 16-bit (bf16/fp16) instead of 32-bit, roughly halving weight, gradient, and activation memory while keeping training stable. bf16 is preferred on modern GPUs for its numerical range. It's nearly free quality-wise and standard practice.

Managing activations

Activations are the lever you control at run time. They grow with batch size × sequence length, so the two biggest knobs for fitting a run are exactly those. Gradient checkpointing (next lesson) trades compute to shrink activation memory dramatically. Combined with the earlier lessons: pick max_seq_length from your data, set a per-device batch that fits, and recover the effective batch with gradient accumulation.

You now have a mental calculator: weights + gradients + optimizer state + activations, scaled by precision. When the sum exceeds your GPU, you get an OOM — and the next lesson is the field guide to surviving exactly that.

Key terms

Weights memory: ≈ params × bytes-per-param (2 bytes at bf16).
Gradient memory: One value per trainable parameter (≈ weights size for the trained params).
Optimizer state: AdamW's ~8 bytes per trainable parameter — often the biggest consumer in full fine-tuning.
Activations: Forward-pass intermediates kept for backprop; scale with batch size × sequence length.
Mixed precision (bf16): Computing/storing in 16-bit to roughly halve memory with stable training.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.