Track 4 · Advanced · Lesson 6

Quantization & compression: Q4_K_M, AWQ, GPTQ, GGUF, ONNX

After this lesson you can explain what quantization trades away, distinguish the common methods (Q4_K_M, AWQ, GPTQ) and formats (GGUF, ONNX, safetensors), and pick a quantization target for a deployment constraint.

Level: advanced Read time: ~11 min Prerequisites: Preference tuning: DPO and ORPO

A model's weights are numbers. Storing each in 16 bits (bf16) is the training default; storing them in 4 bits makes the model ~4× smaller and often faster, for a small quality cost. That's quantization, and it's how a model that needed a server ends up running on a laptop.

Post-training quantization

You already met QLoRA (Track 1) — training on a quantized base. Here we quantize after training: take the finished model and convert its weights to a lower-bit format. No retraining, just a conversion. The art is doing it without wrecking quality — naive rounding loses too much, so the methods below are smarter about which values get how many bits.

The methods

The formats (the container, not the method)

Don't confuse the quantization method with the file format:

On the platform

This is Track 3's Export stage (10), with quantization as a first-class choice. The export emits the artifact plus a smoke-check trace, and the variant is tracked so the Compression page knows what exists:

export → format: gguf,  quant: Q4_K_M        # CPU / edge
       → format: safetensors                  # full precision, GPU
       → format: onnx                          # cross-runtime
# tracked variants: Q4_K_M | AWQ | GPTQ   (the Compression page lists them)
# RunEvent: export (info) — or export_quantization_failed on the named failure

Always re-evaluate after quantizing

Quantization changes the model's numbers, so it can change its outputs. Re-run your gold set on the quantized artifact — a 4-bit model that's 0.5% worse may be a great trade, but you only know if you measure. The export's smoke check is a floor, not a substitute for your eval.

Picking a target

Key idea

Quantization shrinks a trained model by storing weights in fewer bits. Method (Q4_K_M / AWQ / GPTQ) ≠ format (GGUF / ONNX / safetensors). Pick by deployment target, and always re-evaluate — the size win is only real if quality holds.

Key terms

quantization
Storing model weights in fewer bits (e.g. 4) to shrink and speed up inference.
post-training quantization
Quantizing a finished model by conversion, with no retraining.
Q4_K_M
A 4-bit mixed-precision k-quant (llama.cpp/GGUF); the common CPU/edge sweet spot.
AWQ
Activation-aware Weight Quantization; protects activation-critical weights, GPU-oriented.
GPTQ
One-shot, layer-wise quantization minimizing introduced error; GPU-oriented.
GGUF / ONNX
Container formats — GGUF for llama.cpp/edge, ONNX for cross-runtime portability.
Compression page
The BrewSLM surface tracking which quantized variants of a model exist.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.