Serving and inference optimization
After this lesson you can explain what an inference server does, why the KV cache and continuous batching matter, the latency-vs-throughput trade-off, and how serving choices connect to the quantization you learned in 4.6.
You exported and deployed a model in Track 3. This lesson is what's happening inside that endpoint — and the knobs that decide whether it costs you a little or a lot to run.
What an inference server does
A naive loop calls model.generate() for one request at a time — fine for a demo, terrible for production. An inference server like vLLM sits in front of the model and turns it into an efficient, concurrent service: it batches requests, manages memory, and streams tokens back. BrewSLM's deploy targets a vLLM endpoint (among others).
The KV cache: don't recompute the past
Generation is autoregressive (Track 0): each new token attends to all previous tokens. Recomputing the attention keys and values for the whole prefix at every step would be quadratic waste. The KV cache stores those keys and values so each step only computes the new token's — the single biggest reason generation is tractable. Its cost is memory: the cache grows with sequence length and batch size, and managing it well (vLLM's paged attention) is most of what a serving engine does.
Continuous batching: keep the GPU full
Requests arrive at different times and finish at different lengths. Static batching waits for a whole batch, then runs it to completion — leaving the GPU idle as fast requests wait for slow ones. Continuous batching swaps finished sequences out and new ones in every step, keeping the GPU saturated. This is the main reason a good server delivers many times the throughput of a loop.
Latency vs throughput: the core trade
- Latency — time to first/!last token for one request. What an interactive user feels.
- Throughput — total tokens/second across all requests. What sets your cost per million tokens.
They pull against each other: bigger batches raise throughput (cheaper) but can raise per-request latency. Tune to the use case — a chat UI optimizes latency; a nightly batch job optimizes throughput.
# same model, two serving goals
chat UI (interactive) -> small batches, low latency (fast first token)
batch job (offline) -> large batches, high throughput (cheap per token)
# quantized weights (AWQ / GPTQ) use less memory
# -> a bigger KV cache + larger batches fit -> more throughput
From 4.6
Quantized serving compounds these wins: a 4-bit model (AWQ/GPTQ on GPU) uses less memory, which means a bigger KV cache and larger batches fit — so quantization buys throughput, not just disk size. Serving and compression are one decision.
Key idea
A serving engine like vLLM turns a model into a cheap, concurrent service via the KV cache (don't recompute the prefix) and continuous batching (keep the GPU full). Tune the latency-vs-throughput trade to the use case, and remember quantization buys serving headroom, not just disk.
Key terms
- inference server
- A service (e.g. vLLM) that batches, schedules, and streams generation efficiently.
- vLLM
- A high-throughput inference engine using paged attention for KV-cache management.
- KV cache
- Stored attention keys/values for prior tokens so each step only computes the new token's.
- continuous batching
- Swapping finished requests out and new ones in each step to keep the GPU saturated.
- latency vs throughput
- Per-request speed vs total tokens/second; bigger batches favor throughput over latency.
- quantized serving
- Serving a low-bit model so more KV cache / larger batches fit, raising throughput.
Check yourself
Answers are saved to this browser.