OOM and how to survive it
After this lesson you can diagnose a CUDA out-of-memory error and apply the right levers — in the right order — to make a run fit without needlessly sacrificing quality.
Sooner or later you'll see CUDA out of memory. It's the most common fine-tuning failure and, armed with the last lesson's memory model, an entirely solvable one. This is the field guide.
Read the error
The OOM means the four consumers — weights, gradients, optimizer state, activations — summed to more than your GPU has. Usually it strikes on the first training step (activations spike during the backward pass) or when a long sequence arrives. The fix is to reduce one or more consumers. Below, levers are ordered roughly from "cheapest to quality" to "biggest hammer."
The levers, in order
- Lower the per-device batch size, and restore the effective batch with gradient accumulation. This directly cuts activation memory with no change to training dynamics — almost always the first move.
- Enable gradient checkpointing. Instead of storing all activations for the backward pass, it re-computes them on the fly — trading ~20–30% more compute for a large drop in activation memory. Often the single most effective fix.
- Reduce
max_seq_lengthif your data allows (recall activations scale with sequence length). Don't clip completions to do it. - Use mixed precision (bf16) if you somehow aren't — it halves much of the memory.
- Switch to LoRA if you were full fine-tuning — this removes the base model's gradients and optimizer state, usually the biggest win of all.
- Use QLoRA (4-bit quantized base) if even LoRA's weight memory won't fit — quarters the weight footprint.
- Lower the LoRA rank / target fewer modules for a final trim.
Order of operations
Start with batch size + gradient accumulation and gradient checkpointing — they cost you almost nothing in quality. Only escalate to QLoRA or smaller rank when the cheaper levers run out. Change one lever at a time so you know what worked.
Why "auto-retry on OOM" exists
Because OOM is so common, mature training systems (BrewSLM included) automate this: on an OOM they automatically retry with a smaller sequence length or batch before giving up, so a single oversized example doesn't kill an overnight run. Knowing the manual levers means you understand exactly what such auto-retry is doing — and can configure it sensibly.
A caution
Don't fix OOM by silently truncating completions or dropping hard examples — that trades a crash for a quieter quality bug (recall the truncation warning from the tokenization lesson). The levers above shrink memory without corrupting your data; reach for those first. With memory under control, the last two lessons of the track turn to reading the result: spotting overfitting, and evaluating properly.
Key terms
- CUDA OOM
- Out-of-memory error when the four memory consumers exceed GPU capacity.
- Gradient checkpointing
- Re-computing activations during backprop instead of storing them — trades compute for big memory savings.
- Batch-size lever
- Lowering per-device batch (then restoring effective batch via accumulation) to cut activation memory without changing dynamics.
- Auto-OOM retry
- Automatically re-running a step with smaller sequence/batch after an OOM, so one big example doesn't kill a run.
Check yourself
Answers are saved to this browser.