What's the default approach to model size when picking a base?

Default small; step up only when the small model demonstrably can't meet the task's quality bar.

Why is the tokenizer family a hard constraint for distillation?

Offline distillation requires teacher and student to share a tokenizer so the captured top-k token ids mean the same tokens for both models.

Track 0 · Foundations · Lesson 11

Picking a base model

After this lesson you can name the dimensions a base-model choice trades against — license, tokenizer, instruct quality, context length, size, community — and pick a sensible starting point for a project (or explain why this Academy defaults to SmolLM2-135M-Instruct).

Level: beginner Read time: ~9 min Prerequisites: Base vs instruct models

There is no universal "best" SLM. There are constraints, and the smallest model that meets your constraints is the right one to start with. The previous lesson sorted out which flavour of a family to use (almost always: instruct). This lesson is the checklist for picking which family in the first place.

The dimensions that matter

A base-model choice is a trade across six dimensions. Walk down the list, write the constraints your project actually has, and the field of candidates narrows fast:

License — can you legally ship what you train from it?
Tokenizer family — a contract with your dataset and a hard constraint for distillation.
Instruct quality — how aligned is the starting point already?
Context length — how long is your input plus output?
Size & latency budget — what fits, what's fast enough, what costs what?
Community — does the ecosystem support what you want to do next?

License: can you ship this commercially?

Check the license before training, not at deploy. Permissive licences (Apache 2.0, MIT) let you fine-tune and ship a commercial product without much friction. Others — research-only, non-commercial, "acceptable use" addenda — restrict what you can do with derivatives. A fine-tune doesn't reset the licence; the parent model's terms travel with it.

For this Academy we use a permissively licensed family (Apache 2.0) so every reader can take what they learn into a product without a legal review.

Tokenizer: a contract with your data

The tokenizer is the rule that turns text into the model's vocabulary (Lesson 0.4). A few practical reasons it matters when you pick:

Vocab efficiency on your data. Different tokenizers split your domain differently. Code-heavy data tokenizes more efficiently under a tokenizer trained on code; multilingual data needs a multilingual tokenizer to avoid blowing up sequence lengths.
Special-token expectations. Instruct models bake their chat template into specific special tokens — they aren't interchangeable across families.
A hard constraint for distillation. Offline knowledge distillation (Track 4, Lesson 4.2) captures the teacher's top-k token ids. If teacher and student don't share a tokenizer, those ids mean different tokens to each model and the soft targets are gibberish. Pick teacher and student from the same family if distillation is on your roadmap.

Instruct quality: how aligned is the starting point?

Not all instruct models are equally well-tuned. Some refuse confidently; some are chatty; some hold formats well; some drift. The cheapest way to find out is to load two or three candidate instruct models, hand them a few of your real prompts, and look at the raw outputs. Baseline quality is your starting line. A fine-tune lifts it, but starting closer to the answer means cheaper, faster lifts.

Context length: how long is your input?

Modern SLMs ship with context windows from 2k up to 128k tokens. Match this to your task's actual input length distribution, not the worst case from a paper. Long-context models cost more compute per token and more memory per request; picking 32k when 2k would have done is real overhead at scale.

Size & latency budget: small first, step up with reason

Default to the smallest model that plausibly works for the task. Common starting points: 135M, 360M, 1B, 1.5B, 3B. Step up to 7B only when the smaller model demonstrably can't clear your gold-set bar after honest data work — not because bigger sounds safer.

The cost is not just GPU memory at training (Track 1, Lesson 1.15). A 7B model is orders of magnitude more expensive at inference too, every request, forever. If a 1B model gets you to your quality gate, ship the 1B.

"Small first" can fail — be honest about it

If your task genuinely requires broad world knowledge or multi-step reasoning, no amount of LoRA on a 135M model will get you there. That's the lesson of LLMs vs SLMs made concrete: small models give some things up. If your eval makes that clear, stepping up to a larger model — or switching to RAG, or distilling from a bigger teacher — is the right answer, not more epochs.

Community: who else is fine-tuning on this?

A model with an active fine-tuning community is a model you can ride: more recipes, more reported quirks, more shared evals, more tools that work on day one. Hugging Face download counts, the number of public LoRA adapters in the family, the activity on the model card's issues — all rough but useful signals. Picking an obscure model is fine if you have a strong reason; otherwise the safe pick is one the ecosystem already knows.

Why this Academy uses SmolLM2-135M-Instruct

Concretely, the choice for the whole course is a trade matching the constraints above:

License: Apache 2.0 — you can ship a commercial fine-tune.
Tokenizer: HF-native, shared across the SmolLM2 family — leaves the door open for distillation across sizes.
Instruct quality: usable baseline alignment; well-behaved chat template.
Context length: 2k — sufficient for classification, extraction, short Q&A; the Academy's running tasks fit.
Size: 135M parameters — fits comfortably on a single consumer GPU (and even runs slowly on CPU), so every reader can follow along without a server.
Community: HF-native, broad tooling support; vLLM, llama.cpp / GGUF, peft / TRL all work day one.

Key idea

There is no universal best — pick the smallest model that meets your licence, tokenizer, context, and quality constraints. The instinct to "just use the biggest" is exactly backwards: bigger means more memory, more latency, more cost on every request, forever. Step up only when the small model demonstrably can't.

That ends the Foundations track. You now know what a model is, how it learns, how text becomes tokens, how attention and the Transformer work, how a language model predicts the next token, the four levers you can pull, the trade between large and small, what an instruct model is, and how to pick a base. From here we go deep on supervised fine-tuning itself — Track 1.

Key terms

Model license: The terms attached to a base model that travel with every derivative — including your fine-tune. Check before training, not at deploy.
Tokenizer family: A shared vocabulary and special-token set across a model family; required for cross-model distillation and shared chat templates.
Instruct quality: How well an instruct model already follows instructions and respects formats before any fine-tuning; the starting line a fine-tune lifts from.
Context length: The maximum tokens the model can take in plus produce; pick to match your real input/output distribution, not the worst case.
Small first: Default to the smallest model that plausibly works; step up only when the eval makes the case.
Community support: Active fine-tunes, shared evals, tools that work day one; the practical reason to favour well-known families when other constraints tie.

Check yourself

Four questions. Answers are saved to this browser.

Progress is stored locally in your browser.