Quantization & compression: Q4_K_M, AWQ, GPTQ, GGUF, ONNX
After this lesson you can explain what quantization trades away, distinguish the common methods (Q4_K_M, AWQ, GPTQ) and formats (GGUF, ONNX, safetensors), and pick a quantization target for a deployment constraint.
A model's weights are numbers. Storing each in 16 bits (bf16) is the training default; storing them in 4 bits makes the model ~4× smaller and often faster, for a small quality cost. That's quantization, and it's how a model that needed a server ends up running on a laptop.
Post-training quantization
You already met QLoRA (Track 1) — training on a quantized base. Here we quantize after training: take the finished model and convert its weights to a lower-bit format. No retraining, just a conversion. The art is doing it without wrecking quality — naive rounding loses too much, so the methods below are smarter about which values get how many bits.
The methods
- Q4_K_M — a 4-bit "k-quant" from the llama.cpp family. The
Kmeans mixed precision within the model (more bits for the sensitive layers);Mis the medium-size variant. The default sweet spot for CPU/edge: ~4× smaller, small quality loss. - AWQ (Activation-aware Weight Quantization) — protects the weights that matter most to activations; strong for GPU inference.
- GPTQ — a one-shot, layer-by-layer quantization that minimizes the error introduced; also GPU-oriented.
The formats (the container, not the method)
Don't confuse the quantization method with the file format:
- GGUF — the llama.cpp format; the home of Q4_K_M and friends; great for CPU/edge.
- safetensors / HF — full- or half-precision weights for the Transformers/vLLM stack on GPU.
- ONNX — a cross-runtime graph format for portable inference across engines and hardware.
On the platform
This is Track 3's Export stage (10), with quantization as a first-class choice. The export emits the artifact plus a smoke-check trace, and the variant is tracked so the Compression page knows what exists:
export → format: gguf, quant: Q4_K_M # CPU / edge
→ format: safetensors # full precision, GPU
→ format: onnx # cross-runtime
# tracked variants: Q4_K_M | AWQ | GPTQ (the Compression page lists them)
# RunEvent: export (info) — or export_quantization_failed on the named failure
Always re-evaluate after quantizing
Quantization changes the model's numbers, so it can change its outputs. Re-run your gold set on the quantized artifact — a 4-bit model that's 0.5% worse may be a great trade, but you only know if you measure. The export's smoke check is a floor, not a substitute for your eval.
Picking a target
- Laptop / phone / CPU → GGUF + Q4_K_M.
- GPU server, throughput-first → AWQ or GPTQ in the vLLM stack.
- Portability across runtimes → ONNX.
- Quality-critical, room to spare → keep safetensors at bf16.
Key idea
Quantization shrinks a trained model by storing weights in fewer bits. Method (Q4_K_M / AWQ / GPTQ) ≠ format (GGUF / ONNX / safetensors). Pick by deployment target, and always re-evaluate — the size win is only real if quality holds.
Key terms
- quantization
- Storing model weights in fewer bits (e.g. 4) to shrink and speed up inference.
- post-training quantization
- Quantizing a finished model by conversion, with no retraining.
- Q4_K_M
- A 4-bit mixed-precision k-quant (llama.cpp/GGUF); the common CPU/edge sweet spot.
- AWQ
- Activation-aware Weight Quantization; protects activation-critical weights, GPU-oriented.
- GPTQ
- One-shot, layer-wise quantization minimizing introduced error; GPU-oriented.
- GGUF / ONNX
- Container formats — GGUF for llama.cpp/edge, ONNX for cross-runtime portability.
- Compression page
- The BrewSLM surface tracking which quantized variants of a model exist.
Check yourself
Answers are saved to this browser.