Track 0 · Foundations · Lesson 8

LLMs vs SLMs: scale, cost, latency

After this lesson you can reason about the trade-off between large and small models in concrete terms — memory, latency, cost — and explain when a fine-tuned small model is the right choice over a frontier LLM.

Level: beginner Read time: ~9 min Prerequisites: The four levers

"Large" and "small" language models run on the exact same machinery you've now seen — tokens, embeddings, Transformer blocks, next-token prediction. The difference is one number: the parameter count. That single number cascades into memory, speed, cost, and capability, and understanding the cascade tells you which model to reach for.

The one axis: parameters

Model size is measured in parameters: a small model like SmolLM2 has ~135 million; mid-size open models are 1–8 billion; frontier models are tens to hundreds of billions. More parameters means more capacity to absorb patterns during pretraining, which is why the biggest models display the broadest knowledge and strongest general reasoning.

But every parameter has to be stored and multiplied on every token. So scale is not free.

What scale costs: memory, latency, money

metric SLM (≈135M) LLM (≈70B) weights @ bf16~0.27 GB~140 GB runs onlaptop / 1 GPUmulti-GPU server relative latencylowhigh broad knowledgenarrowbroad
Orders of magnitude, not exact figures. Memory uses the bytes ≈ params × bytes-per-param rule of thumb (bf16 = 2 bytes).

What small models give up — and what they don't

Out of the box, a small model knows less and reasons less broadly. Ask a 135M base model open-ended trivia and it will disappoint. That's the honest downside.

But most production work is not open-ended. It's one narrow, repeated task: classify this ticket, extract these fields, answer questions about these docs, rewrite in this format. For a narrow task, broad knowledge is mostly wasted — and this is the crux:

The thesis of this course

On a single, well-defined task, a fine-tuned small model can match — sometimes beat — a model hundreds of times larger, at a fraction of the cost and latency, running on hardware you control. Fine-tuning concentrates a small model's limited capacity onto your task.

When to pick which

A useful habit

Frame the decision as quality-per-dollar (or per-millisecond) for your task, not quality in the abstract. A model that's 95% as good at 2% of the cost is usually the better engineering choice — and you can only know those numbers by evaluating, which is why evaluation gets its own lessons.

End of the foundations

You now have the whole mental model: what a model is, how it learns, how text becomes math, how the Transformer and next-token prediction produce language, the four levers for steering a model, and why small models are worth fine-tuning. The final Foundations lesson zooms all the way out to the shape of a real SLM project — the loop you'll execute for the rest of the course.

Key terms

Parameter count
The number of weights in a model; the primary axis of "size."
Small language model (SLM)
A model with relatively few parameters (millions to a few billion).
VRAM
GPU memory; weights need ≈ params × bytes-per-param, training needs several times more.
Latency
Time to produce output; rises with model size since each token runs the whole network.
Cost per token
The price of generating output, which tracks compute and therefore size.
Narrow vs general task
A single repeatable job vs open-ended capability; narrow tasks favor fine-tuned SLMs.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.