QLoRA hands-on with bitsandbytes
After this lesson you can load a base model in 4-bit with bitsandbytes, attach a LoRA adapter, and fine-tune a model several times larger than what 2.5 used — in the same VRAM — using a small modification to the SFTTrainer pipeline.
Lesson 1.14 introduced QLoRA conceptually: train a LoRA adapter on top of a base model that's been quantized to 4 bits, cutting the base's memory ~4× while preserving the adapter's training dynamics. This lesson is the code. We'll step from SmolLM2-135M to a model in the 1–1.5B range and fit it in the VRAM the small model was using.
The setup: bitsandbytes
QLoRA needs one extra library beyond Lesson 2.10's stack: bitsandbytes, which provides the 4-bit quantization kernels Transformers loads through BitsAndBytesConfig.
pip install bitsandbytes
Configure 4-bit loading
Three knobs matter: the quantization scheme (NF4 — 4-bit Normal Float, the QLoRA default), double quantization (quantize the quantization constants too, saving a bit more), and the compute dtype (the dtype for the activations and gradients on top of the quantized weights — bf16 on modern GPUs).
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_ID = "Qwen/Qwen2.5-1.5B-Instruct" # ~1.5B params, Apache-2 license
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # the QLoRA default
bnb_4bit_compute_dtype=torch.bfloat16, # activations + grads in bf16
bnb_4bit_use_double_quant=True, # quantize the quant constants
)
tok = AutoTokenizer.from_pretrained(MODEL_ID)
if tok.pad_token is None: tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
quantization_config=bnb,
device_map="auto", # let accelerate place the layers
)
Prepare the model for adapter training
One required call: prepare_model_for_kbit_training from peft. It sets up gradient checkpointing (Lesson 1.15), casts the LayerNorm and embedding layers to a numerically stable dtype, and disables certain optimizer states the quantized base can't carry. Forgetting this is the most common QLoRA-doesn't-train bug.
from peft import LoraConfig, prepare_model_for_kbit_training
model = prepare_model_for_kbit_training(model)
# Wider target_modules — common QLoRA recipe touches all attention + MLP projections
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
task_type="CAUSAL_LM",
)
Train it — same SFTTrainer as Lesson 2.10
From here on it's the SFT pipeline you know. The only thing the trainer sees that's different is that the base is 4-bit and the adapter sits on top.
from trl import SFTConfig, SFTTrainer
# `ds` from Lesson 2.10's to_chat helper
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="qlora-out",
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # effective batch 16
learning_rate=2e-4,
num_train_epochs=3,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=5,
save_strategy="epoch",
max_seq_length=512,
completion_only_loss=True,
report_to=[],
),
train_dataset=ds,
processing_class=tok,
peft_config=lora,
)
trainer.train()
trainer.save_model("qlora-out/adapter")
Why this fits — the memory math made concrete
From Lesson 1.15: the dominant memory cost during training is gradients + optimizer states on the trainable parameters, plus the model weights themselves. QLoRA changes both halves:
- The 1.5B base in 4-bit takes
1.5e9 × 0.5 bytes ≈ 0.75 GB. Compare to bf16 (3 GB) or fp32 (6 GB). - The base is frozen — no gradient memory, no optimizer state for it.
- The LoRA adapter trains; for the wider
target_modulesabove on a 1.5B model, that's a few million parameters with bf16 gradients + AdamW state — maybe ~80 MB. - Activations + KV cache at sequence length 512, batch 4 — a few GB.
Total: comfortably under 12 GB. A model that would have needed ~24 GB to LoRA-fine-tune at full precision fits in the budget of a consumer GPU.
Honest beat — 4-bit isn't free
QLoRA usually loses ~1–2 points on whatever metric you're tracking compared to LoRA on a full-precision base of the same size. Sometimes more on the hardest examples. Re-evaluate after a QLoRA run — don't assume the metric you measured on the bf16 base will hold. The right framing: QLoRA lets you fine-tune a larger model in the same VRAM; the trade isn't "free 4×" but "smaller-precision 1.5B vs full-precision 350M, whichever wins on your eval."
Inference notes
For inference you have two choices, mirroring Lesson 2.8:
- Load 4-bit at inference too — keep the quantization config, load the base in 4-bit, attach the adapter with
PeftModel.from_pretrained. Same memory budget as training. - Merge and dequantize — for production serving, merge the adapter into a bf16 (or full-precision) base; you lose the memory win, but adapter merging into a quantized base isn't supported cleanly. The honest path for serving is to keep base + adapter separate or to convert the merged model to GGUF/AWQ for deployment (Track 4.6).
Key idea
QLoRA is SFTTrainer with two changes: load the base via BitsAndBytesConfig (NF4 + double quant + bf16 compute), and call prepare_model_for_kbit_training before attaching LoRA. That's it. The win is fitting a model 4–10× larger in the same VRAM. The cost is a small re-evaluable quality drop.
You can now train at scales the small model can't reach. Next: graduate Lesson 2.7's hand-rolled accuracy into real, per-class metrics with sklearn and HF evaluate.
Key terms
- QLoRA
- A 4-bit quantized frozen base with a LoRA adapter trained on top; cuts memory ~4× vs bf16 LoRA on the same model.
- bitsandbytes
- Library providing 4-bit/8-bit quantization kernels HF Transformers loads via
BitsAndBytesConfig. - BitsAndBytesConfig
- HF config object that tells
from_pretrainedto load weights quantized. - NF4 (4-bit Normal Float)
- The QLoRA default quantization format; tuned for the weight distribution of trained Transformers.
- Double quantization
- Quantizing the quantization constants themselves to save a bit more memory.
- prepare_model_for_kbit_training
- peft helper that enables gradient checkpointing and stable-dtype casts before LoRA is attached to a quantized base.
Check yourself
Answers are saved to this browser.