What does QLoRA actually train?

The LoRA adapter only — the base is frozen and quantized to 4-bit, so it doesn't take gradient memory at all.

Why call prepare_model_for_kbit_training?

It sets up gradient checkpointing, casts norm layers to a stable dtype, and makes the model compatible with adapter training on a quantized base.

Track 2 · Hands-on · Lesson 11

QLoRA hands-on with bitsandbytes

After this lesson you can load a base model in 4-bit with bitsandbytes, attach a LoRA adapter, and fine-tune a model several times larger than what 2.5 used — in the same VRAM — using a small modification to the SFTTrainer pipeline.

Level: intermediate Read time: ~10 min Prerequisites: SFT with TRL's SFTTrainer

Lesson 1.14 introduced QLoRA conceptually: train a LoRA adapter on top of a base model that's been quantized to 4 bits, cutting the base's memory ~4× while preserving the adapter's training dynamics. This lesson is the code. We'll step from SmolLM2-135M to a model in the 1–1.5B range and fit it in the VRAM the small model was using.

The setup: bitsandbytes

QLoRA needs one extra library beyond Lesson 2.10's stack: bitsandbytes, which provides the 4-bit quantization kernels Transformers loads through BitsAndBytesConfig.

pip install bitsandbytes

Configure 4-bit loading

Three knobs matter: the quantization scheme (NF4 — 4-bit Normal Float, the QLoRA default), double quantization (quantize the quantization constants too, saving a bit more), and the compute dtype (the dtype for the activations and gradients on top of the quantized weights — bf16 on modern GPUs).

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct"      # ~1.5B params, Apache-2 license

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",                # the QLoRA default
    bnb_4bit_compute_dtype=torch.bfloat16,    # activations + grads in bf16
    bnb_4bit_use_double_quant=True,           # quantize the quant constants
)

tok = AutoTokenizer.from_pretrained(MODEL_ID)
if tok.pad_token is None: tok.pad_token = tok.eos_token

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    quantization_config=bnb,
    device_map="auto",                        # let accelerate place the layers
)

Prepare the model for adapter training

One required call: prepare_model_for_kbit_training from peft. It sets up gradient checkpointing (Lesson 1.15), casts the LayerNorm and embedding layers to a numerically stable dtype, and disables certain optimizer states the quantized base can't carry. Forgetting this is the most common QLoRA-doesn't-train bug.

from peft import LoraConfig, prepare_model_for_kbit_training

model = prepare_model_for_kbit_training(model)

# Wider target_modules — common QLoRA recipe touches all attention + MLP projections
lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    task_type="CAUSAL_LM",
)

Train it — same SFTTrainer as Lesson 2.10

From here on it's the SFT pipeline you know. The only thing the trainer sees that's different is that the base is 4-bit and the adapter sits on top.

from trl import SFTConfig, SFTTrainer
# `ds` from Lesson 2.10's to_chat helper

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="qlora-out",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,        # effective batch 16
        learning_rate=2e-4,
        num_train_epochs=3,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        logging_steps=5,
        save_strategy="epoch",
        max_seq_length=512,
        completion_only_loss=True,
        report_to=[],
    ),
    train_dataset=ds,
    processing_class=tok,
    peft_config=lora,
)
trainer.train()
trainer.save_model("qlora-out/adapter")

Why this fits — the memory math made concrete

From Lesson 1.15: the dominant memory cost during training is gradients + optimizer states on the trainable parameters, plus the model weights themselves. QLoRA changes both halves:

The 1.5B base in 4-bit takes 1.5e9 × 0.5 bytes ≈ 0.75 GB. Compare to bf16 (3 GB) or fp32 (6 GB).
The base is frozen — no gradient memory, no optimizer state for it.
The LoRA adapter trains; for the wider target_modules above on a 1.5B model, that's a few million parameters with bf16 gradients + AdamW state — maybe ~80 MB.
Activations + KV cache at sequence length 512, batch 4 — a few GB.

Total: comfortably under 12 GB. A model that would have needed ~24 GB to LoRA-fine-tune at full precision fits in the budget of a consumer GPU.

Honest beat — 4-bit isn't free

QLoRA usually loses ~1–2 points on whatever metric you're tracking compared to LoRA on a full-precision base of the same size. Sometimes more on the hardest examples. Re-evaluate after a QLoRA run — don't assume the metric you measured on the bf16 base will hold. The right framing: QLoRA lets you fine-tune a larger model in the same VRAM; the trade isn't "free 4×" but "smaller-precision 1.5B vs full-precision 350M, whichever wins on your eval."

Inference notes

For inference you have two choices, mirroring Lesson 2.8:

Load 4-bit at inference too — keep the quantization config, load the base in 4-bit, attach the adapter with PeftModel.from_pretrained. Same memory budget as training.
Merge and dequantize — for production serving, merge the adapter into a bf16 (or full-precision) base; you lose the memory win, but adapter merging into a quantized base isn't supported cleanly. The honest path for serving is to keep base + adapter separate or to convert the merged model to GGUF/AWQ for deployment (Track 4.6).

Key idea

QLoRA is SFTTrainer with two changes: load the base via BitsAndBytesConfig (NF4 + double quant + bf16 compute), and call prepare_model_for_kbit_training before attaching LoRA. That's it. The win is fitting a model 4–10× larger in the same VRAM. The cost is a small re-evaluable quality drop.

You can now train at scales the small model can't reach. Next: graduate Lesson 2.7's hand-rolled accuracy into real, per-class metrics with sklearn and HF evaluate.

Key terms

QLoRA: A 4-bit quantized frozen base with a LoRA adapter trained on top; cuts memory ~4× vs bf16 LoRA on the same model.
bitsandbytes: Library providing 4-bit/8-bit quantization kernels HF Transformers loads via BitsAndBytesConfig.
BitsAndBytesConfig: HF config object that tells from_pretrained to load weights quantized.
NF4 (4-bit Normal Float): The QLoRA default quantization format; tuned for the weight distribution of trained Transformers.
Double quantization: Quantizing the quantization constants themselves to save a bit more memory.
prepare_model_for_kbit_training: peft helper that enables gradient checkpointing and stable-dtype casts before LoRA is attached to a quantized base.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.