LLMs vs SLMs: scale, cost, latency
After this lesson you can reason about the trade-off between large and small models in concrete terms — memory, latency, cost — and explain when a fine-tuned small model is the right choice over a frontier LLM.
"Large" and "small" language models run on the exact same machinery you've now seen — tokens, embeddings, Transformer blocks, next-token prediction. The difference is one number: the parameter count. That single number cascades into memory, speed, cost, and capability, and understanding the cascade tells you which model to reach for.
The one axis: parameters
Model size is measured in parameters: a small model like SmolLM2 has ~135 million; mid-size open models are 1–8 billion; frontier models are tens to hundreds of billions. More parameters means more capacity to absorb patterns during pretraining, which is why the biggest models display the broadest knowledge and strongest general reasoning.
But every parameter has to be stored and multiplied on every token. So scale is not free.
What scale costs: memory, latency, money
- Memory (VRAM). Parameters live in GPU memory. A rough rule: bytes ≈ parameters × bytes-per-parameter. At 2 bytes each (bf16), a 135M model needs ~0.27 GB just for weights; a 70B model needs ~140 GB — many GPUs. Training needs several times more (for gradients and optimizer state), which is why fine-tuning a giant model is a data-center activity and fine-tuning an SLM fits on one consumer GPU.
- Latency. Generating each token runs the whole network. More parameters → more arithmetic per token → slower responses. Small models reply faster, which matters for interactive and high-volume use.
- Cost. Whether you pay an API per token or rent a GPU per hour, cost tracks compute, which tracks size. A task served billions of times is dramatically cheaper on a small model.
What small models give up — and what they don't
Out of the box, a small model knows less and reasons less broadly. Ask a 135M base model open-ended trivia and it will disappoint. That's the honest downside.
But most production work is not open-ended. It's one narrow, repeated task: classify this ticket, extract these fields, answer questions about these docs, rewrite in this format. For a narrow task, broad knowledge is mostly wasted — and this is the crux:
The thesis of this course
On a single, well-defined task, a fine-tuned small model can match — sometimes beat — a model hundreds of times larger, at a fraction of the cost and latency, running on hardware you control. Fine-tuning concentrates a small model's limited capacity onto your task.
When to pick which
- Pick a large general model when you need broad, open-ended capability, low volume, and don't want to train anything — a do-everything assistant, exploratory work, or the first prototype.
- Pick a fine-tuned small model when you have one repeatable task at meaningful volume, care about cost/latency/privacy, or need to run locally/offline. This is BrewSLM's target: take a small base, fine-tune it on your data, and ship something cheap and fast that's good enough — and then prove it's good enough (Track 4's SLM-vs-frontier report does exactly this).
A useful habit
Frame the decision as quality-per-dollar (or per-millisecond) for your task, not quality in the abstract. A model that's 95% as good at 2% of the cost is usually the better engineering choice — and you can only know those numbers by evaluating, which is why evaluation gets its own lessons.
End of the foundations
You now have the whole mental model: what a model is, how it learns, how text becomes math, how the Transformer and next-token prediction produce language, the four levers for steering a model, and why small models are worth fine-tuning. The final Foundations lesson zooms all the way out to the shape of a real SLM project — the loop you'll execute for the rest of the course.
Key terms
- Parameter count
- The number of weights in a model; the primary axis of "size."
- Small language model (SLM)
- A model with relatively few parameters (millions to a few billion).
- VRAM
- GPU memory; weights need ≈ params × bytes-per-param, training needs several times more.
- Latency
- Time to produce output; rises with model size since each token runs the whole network.
- Cost per token
- The price of generating output, which tracks compute and therefore size.
- Narrow vs general task
- A single repeatable job vs open-ended capability; narrow tasks favor fine-tuned SLMs.
Check yourself
Four questions. Answers are saved to this browser.