Public benchmarks & lm-evaluation-harness
After this lesson you can use public benchmarks (MMLU, HellaSwag, ARC, GSM8K) as honest smoke checks against a checkpoint — not as the truth — run them with EleutherAI's lm-evaluation-harness, read few-shot vs zero-shot numbers correctly, recognise the contamination caveats that quietly inflate headline scores, and stop treating MMLU-65% as a project goal when your gold-set accuracy is the metric that actually pays the bills.
Every SLM model card carries a constellation of three-letter benchmark names — MMLU, HSWAG, ARC, GSM8K — followed by a percentage. They're convenient. They let you compare against the base, against the previous fine-tune, and against the wider model landscape with one number per axis. They are also widely misread: as the goal of fine-tuning rather than a smoke check, as a measure of your task quality rather than a sanity bar, and (worst) as ground truth in cases where the benchmark has leaked into pretraining. This lesson covers what each benchmark actually measures, how to run them, and how to read the numbers without being misled.
What the standard benchmarks measure
- MMLU (Massive Multitask Language Understanding) — multiple-choice questions across 57 subjects from elementary to professional. The closest thing to a general "what does this model know?" score. Usually reported 5-shot.
- HellaSwag — pick the most plausible continuation of a short scenario from four options. Tests commonsense / world-model. Usually 10-shot; zero-shot scores look very different.
- ARC (Easy / Challenge) — grade-school science MCQs. ARC-Challenge filters for questions retrieval-based methods couldn't solve, making it harder.
- GSM8K — grade-school math word problems requiring multi-step arithmetic reasoning. The shape that most predicts whether a chain-of-thought-style instruct model is doing real work.
- TruthfulQA — questions designed to elicit common misconceptions; scores how often the model resists the wrong-but-popular answer.
- WinoGrande — Winograd-style pronoun resolution; lightweight commonsense reasoning.
None of these is your task. None of them measures the thing your fine-tune is supposed to do. They measure the substrate the fine-tune sits on top of.
Few-shot vs zero-shot
Almost every reported benchmark number has a shot count attached: MMLU 5-shot, HellaSwag 10-shot, GSM8K 8-shot CoT. The shot count is the number of labelled in-context examples prepended to the prompt before the question. The difference between zero-shot and 5-shot can be 20+ percentage points; reporting one and calling it the other is the most common honest-mistake in benchmark tables.
Read the convention with the score:
- Zero-shot tells you what the model knows without help. The honest measure of a base model's knowledge.
- Few-shot tells you how well the model picks up a task from examples. The honest measure of in-context learning. Higher numbers, but they're measuring a different thing.
- An instruct model usually wins on zero-shot vs its base; a base model often wins on few-shot vs an instruct, because instruct training rewards templated single-shot completions over learning from in-context examples.
lm-evaluation-harness, the standard tool
EleutherAI's lm-evaluation-harness is the de-facto runner for these benchmarks. It supports hundreds of tasks under a unified evaluator, lets you point at a local Hugging Face checkpoint or a hosted endpoint, and reports per-task accuracy + standard error. It is what nearly every public model card was scored with.
# install
pip install lm-eval
# evaluate a local checkpoint on the standard smoke-check set
lm_eval \
--model hf \
--model_args pretrained=./runs/sft-v3,dtype=bfloat16 \
--tasks mmlu,hellaswag,arc_easy,arc_challenge,gsm8k \
--num_fewshot 5 \
--batch_size 16 \
--output_path eval/sft-v3.json
# compare against the base — same flags, different checkpoint
lm_eval \
--model hf \
--model_args pretrained=HuggingFaceTB/SmolLM2-135M,dtype=bfloat16 \
--tasks mmlu,hellaswag,arc_easy,arc_challenge,gsm8k \
--num_fewshot 5 \
--batch_size 16 \
--output_path eval/base.json
The output JSON has per-task accuracy and a standard error you should always report — a 0.4-point delta with a 0.6-point standard error is noise, not progress.
How to read the numbers
The four useful comparisons:
- Base vs SFT, same shot count — does SFT preserve substrate knowledge? A modest drop is normal (the fine-tune narrowed the model toward your task). A large drop on MMLU / HellaSwag is catastrophic forgetting; revisit Lesson 1.22's mitigations.
- Your fine-tune vs same-size peer models — places you in the landscape. Useful as a "do I have a competitive substrate?" check; not useful as a fine-tune goal.
- Across versions of your own fine-tune — for tracking regressions. If MMLU drops by 3 points between two runs, something changed in your data or recipe that you want to understand.
- Against the lm-eval-harness standard error, not its average — a difference smaller than the SE is statistical noise, not a result.
Why your gold set still matters more
Three reasons the public benchmarks are insufficient on their own:
- They measure substrate, not your task. A 5-point MMLU bump tells you nothing about whether your sentiment classifier is more accurate or your JSON extractor parses more reliably. Your task is your task.
- They are multiple-choice or short-answer. Real outputs are free-form. MMLU accuracy correlates loosely with free-form helpfulness; LLM-as-a-judge (Lesson 16) and your gold set close the gap.
- They're contaminated. Which brings us to —
Contamination caveats
Public benchmarks are widely scraped, widely indexed, and routinely end up in pretraining corpora. The fingerprint of a contaminated benchmark: the model gets the right answer on the canonical question but fails on lightly paraphrased versions of the same question. Several known contamination patterns:
- MMLU has substantial overlap with publicly indexed test-bank questions; many model families have shown the canonical-vs-paraphrased gap.
- HellaSwag contexts and ARC questions have appeared verbatim in pretraining web scrapes.
- GSM8K — the original 8,500 problems have shown up in code repositories, blog posts, and instruction-tuning datasets. The GSM8K-Hard / GSM-Symbolic re-renderings exist specifically to test whether a model learned the structure or memorised the corpus.
Mitigations when you actually need a reliable number:
- Run a paraphrased / re-rendered variant of the benchmark alongside the canonical one. A large canonical-paraphrased gap is the contamination tell.
- Read the model card's contamination section if there is one. Many recent SLMs (Phi, SmolLM) publish n-gram overlap checks.
- Treat public benchmarks as relative, not absolute. Your fine-tune's delta vs your own base is harder to game than the absolute score.
- Keep a small, private, never-published task gold set. That number is the one that matters.
Honest beat — "MMLU 65%" is not a project goal
It's a smoke check. The reason your stakeholder cares about the SLM is the task — whether the JSON parses, whether the extraction lands, whether the support reply was correct. Public-benchmark scores belong in the appendix of the model card, not the project's headline KPI. Build a gold set that measures the task; quote it; then report the substrate numbers as context.
Key idea
Public benchmarks are useful — for placing your model in the landscape, for catching catastrophic regressions during fine-tuning, for a shared vocabulary with the wider community. They are not useful as the truth, the project goal, or a substitute for a task-specific gold set. Run them with lm-evaluation-harness with the shot count published, report the standard error, watch for the canonical-vs-paraphrased contamination signature, and remember that your private gold set is the one that the user pays for.
The next lesson covers the bookkeeping that makes all this comparable across runs three weeks from now: experiment tracking with MLflow and W&B.
Key terms
- lm-evaluation-harness
- EleutherAI's open-source benchmark runner — the standard tool used to produce the MMLU / HellaSwag / ARC / GSM8K numbers in model cards.
- MMLU
- Massive Multitask Language Understanding: ~14k multiple-choice questions across 57 subjects. Usually reported 5-shot. The standard substrate-knowledge benchmark.
- HellaSwag
- Commonsense scenario-continuation MCQs; usually 10-shot. Tests world-model plausibility.
- ARC
- AI2 Reasoning Challenge: grade-school science MCQs in Easy / Challenge splits.
- GSM8K
- Grade-school math word problems requiring multi-step arithmetic; the standard chain-of-thought benchmark.
- Few-shot vs zero-shot
- Few-shot prepends N labelled examples to the prompt; zero-shot prepends none. Different shot counts produce very different scores; the shot count must be reported with the number.
- Standard error (lm-eval)
- The lm-evaluation-harness reports SE alongside accuracy. A delta smaller than the SE is noise, not a result.
- Benchmark contamination
- The benchmark's questions or answers appeared in the model's pretraining corpus; the model memorised rather than reasoned. The tell is a large gap between canonical and paraphrased versions of the same question.
- Substrate vs task
- Public benchmarks measure the substrate of general capability the fine-tune sits on; your gold set measures the task itself. Both matter, in that order.
Check yourself
Answers are saved to this browser.