Full fine-tuning vs LoRA
After this lesson you can explain the difference between full fine-tuning and LoRA, describe how LoRA's low-rank adapters work, and say why LoRA is the default for fine-tuning small models on modest hardware.
The training loop updates parameters — but which parameters? The classic answer is "all of them" (full fine-tuning). The modern default for small-model work is LoRA, which updates almost none of them and yet works remarkably well. Understanding why is one of the most useful things in this track.
Full fine-tuning
Full fine-tuning updates every parameter in the model. It's the most expressive option — nothing is held back — and for large datasets it can squeeze out the best quality. But it's expensive: you must store gradients and optimizer state for every parameter (several times the model's size in memory), and each fine-tuned variant is a full copy of the model on disk. For a small SLM this is feasible on one GPU; it gets painful fast as models grow.
LoRA: train a tiny add-on instead
LoRA (Low-Rank Adaptation) starts from an insight: the change a fine-tune makes to a big weight matrix is usually "low-rank" — it can be well approximated by a much smaller pair of matrices multiplied together. So LoRA freezes the original model entirely and, next to chosen weight matrices, inserts two small trainable matrices whose product is added to the original output. Only those small matrices — the adapter — are trained.
The numbers are striking: a LoRA adapter is often well under 1% of the model's parameters. That cascades into every cost that matters.
Why LoRA wins on small hardware
Because the base is frozen, you only store gradients and optimizer state for the tiny adapter — slashing the memory that usually blocks fine-tuning. And each fine-tune is a few megabytes of adapter, not a full model copy, so you can keep dozens of task-specific adapters over one shared base.
Using the result: keep or merge
At inference you can either load the frozen base plus the adapter (swap adapters to switch tasks), or merge the adapter's learned change back into the base weights to produce a standalone model with no runtime overhead. Merging is common when you'll deploy one task; keeping adapters separate is handy when you serve many.
The trade-off
LoRA trains far fewer parameters, so in principle it's slightly less expressive than full fine-tuning — for very large datasets or when you're reshaping the model deeply, full fine-tuning can edge ahead. In practice, for the narrow SFT tasks this Academy targets, LoRA matches full fine-tuning closely while costing a fraction, which is why it's the default. (You'll also hear QLoRA — LoRA on top of a quantized base — which cuts memory even further; that and LoRA's knobs are the next lesson.)
LoRA, full fine-tuning, and the optimizer all consume GPU memory in specific ways. To choose batch sizes and avoid crashes, you need to be able to estimate that memory — which is exactly where we go next.
Key terms
- Full fine-tuning
- Updating every parameter; most expressive, but heavy on memory and storage.
- LoRA
- Low-Rank Adaptation: freeze the base and train small added low-rank matrices (the adapter).
- Low-rank adapter
- The small trainable matrices LoRA adds next to chosen weights; often <1% of parameters.
- Parameter-efficient fine-tuning (PEFT)
- The family of methods (incl. LoRA) that train a small subset of parameters.
- Adapter merging
- Folding a LoRA adapter back into the base weights to produce a standalone model.
Check yourself
Answers are saved to this browser.