What is usually the first, cheapest lever against OOM?

Lower per-device batch size + gradient accumulation

Gradient checkpointing saves memory by…

Re-computing activations during backprop instead of storing them

Switching from full fine-tuning to LoRA helps OOM because it…

Removes the base model's gradients and optimizer state

Which is the WRONG way to fix OOM?

Silently truncating completions / dropping hard examples

Track 1 · SFT fundamentals · Lesson 16

OOM and how to survive it

After this lesson you can diagnose a CUDA out-of-memory error and apply the right levers — in the right order — to make a run fit without needlessly sacrificing quality.

Level: intermediate Read time: ~8 min Prerequisites: GPU memory math

Sooner or later you'll see CUDA out of memory. It's the most common fine-tuning failure and, armed with the last lesson's memory model, an entirely solvable one. This is the field guide.

Read the error

The OOM means the four consumers — weights, gradients, optimizer state, activations — summed to more than your GPU has. Usually it strikes on the first training step (activations spike during the backward pass) or when a long sequence arrives. The fix is to reduce one or more consumers. Below, levers are ordered roughly from "cheapest to quality" to "biggest hammer."

The levers, in order

Lower the per-device batch size, and restore the effective batch with gradient accumulation. This directly cuts activation memory with no change to training dynamics — almost always the first move.
Enable gradient checkpointing. Instead of storing all activations for the backward pass, it re-computes them on the fly — trading ~20–30% more compute for a large drop in activation memory. Often the single most effective fix.
Reduce max_seq_length if your data allows (recall activations scale with sequence length). Don't clip completions to do it.
Use mixed precision (bf16) if you somehow aren't — it halves much of the memory.
Switch to LoRA if you were full fine-tuning — this removes the base model's gradients and optimizer state, usually the biggest win of all.
Use QLoRA (4-bit quantized base) if even LoRA's weight memory won't fit — quarters the weight footprint.
Lower the LoRA rank / target fewer modules for a final trim.

Order of operations

Start with batch size + gradient accumulation and gradient checkpointing — they cost you almost nothing in quality. Only escalate to QLoRA or smaller rank when the cheaper levers run out. Change one lever at a time so you know what worked.

Why "auto-retry on OOM" exists

Because OOM is so common, mature training systems (BrewSLM included) automate this: on an OOM they automatically retry with a smaller sequence length or batch before giving up, so a single oversized example doesn't kill an overnight run. Knowing the manual levers means you understand exactly what such auto-retry is doing — and can configure it sensibly.

A caution

Don't fix OOM by silently truncating completions or dropping hard examples — that trades a crash for a quieter quality bug (recall the truncation warning from the tokenization lesson). The levers above shrink memory without corrupting your data; reach for those first. With memory under control, the last two lessons of the track turn to reading the result: spotting overfitting, and evaluating properly.

Key terms

CUDA OOM: Out-of-memory error when the four memory consumers exceed GPU capacity.
Gradient checkpointing: Re-computing activations during backprop instead of storing them — trades compute for big memory savings.
Batch-size lever: Lowering per-device batch (then restoring effective batch via accumulation) to cut activation memory without changing dynamics.
Auto-OOM retry: Automatically re-running a step with smaller sequence/batch after an OOM, so one big example doesn't kill a run.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.