What does quantization trade?

A small quality cost for much smaller size and often faster inference

What must you do after quantizing?

Re-run your gold set — quantization can change outputs

Track 4 · Advanced · Lesson 6

Quantization & compression: Q4_K_M, AWQ, GPTQ, GGUF, ONNX

After this lesson you can explain what quantization trades away, distinguish the common methods (Q4_K_M, AWQ, GPTQ) and formats (GGUF, ONNX, safetensors), and pick a quantization target for a deployment constraint.

Level: advanced Read time: ~11 min Prerequisites: Preference tuning: DPO and ORPO

A model's weights are numbers. Storing each in 16 bits (bf16) is the training default; storing them in 4 bits makes the model ~4× smaller and often faster, for a small quality cost. That's quantization, and it's how a model that needed a server ends up running on a laptop.

Post-training quantization

You already met QLoRA (Track 1) — training on a quantized base. Here we quantize after training: take the finished model and convert its weights to a lower-bit format. No retraining, just a conversion. The art is doing it without wrecking quality — naive rounding loses too much, so the methods below are smarter about which values get how many bits.

The methods

Q4_K_M — a 4-bit "k-quant" from the llama.cpp family. The K means mixed precision within the model (more bits for the sensitive layers); M is the medium-size variant. The default sweet spot for CPU/edge: ~4× smaller, small quality loss.
AWQ (Activation-aware Weight Quantization) — protects the weights that matter most to activations; strong for GPU inference.
GPTQ — a one-shot, layer-by-layer quantization that minimizes the error introduced; also GPU-oriented.

The formats (the container, not the method)

Don't confuse the quantization method with the file format:

GGUF — the llama.cpp format; the home of Q4_K_M and friends; great for CPU/edge.
safetensors / HF — full- or half-precision weights for the Transformers/vLLM stack on GPU.
ONNX — a cross-runtime graph format for portable inference across engines and hardware.

On the platform

This is Track 3's Export stage (10), with quantization as a first-class choice. The export emits the artifact plus a smoke-check trace, and the variant is tracked so the Compression page knows what exists:

export → format: gguf,  quant: Q4_K_M        # CPU / edge
       → format: safetensors                  # full precision, GPU
       → format: onnx                          # cross-runtime
# tracked variants: Q4_K_M | AWQ | GPTQ   (the Compression page lists them)
# RunEvent: export (info) — or export_quantization_failed on the named failure

Always re-evaluate after quantizing

Quantization changes the model's numbers, so it can change its outputs. Re-run your gold set on the quantized artifact — a 4-bit model that's 0.5% worse may be a great trade, but you only know if you measure. The export's smoke check is a floor, not a substitute for your eval.

Picking a target

Laptop / phone / CPU → GGUF + Q4_K_M.
GPU server, throughput-first → AWQ or GPTQ in the vLLM stack.
Portability across runtimes → ONNX.
Quality-critical, room to spare → keep safetensors at bf16.

Key idea

Quantization shrinks a trained model by storing weights in fewer bits. Method (Q4_K_M / AWQ / GPTQ) ≠ format (GGUF / ONNX / safetensors). Pick by deployment target, and always re-evaluate — the size win is only real if quality holds.

Key terms

quantization: Storing model weights in fewer bits (e.g. 4) to shrink and speed up inference.
post-training quantization: Quantizing a finished model by conversion, with no retraining.
Q4_K_M: A 4-bit mixed-precision k-quant (llama.cpp/GGUF); the common CPU/edge sweet spot.
AWQ: Activation-aware Weight Quantization; protects activation-critical weights, GPU-oriented.
GPTQ: One-shot, layer-wise quantization minimizing introduced error; GPU-oriented.
GGUF / ONNX: Container formats — GGUF for llama.cpp/edge, ONNX for cross-runtime portability.
Compression page: The BrewSLM surface tracking which quantized variants of a model exist.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.