Track 2 · Hands-on · Lesson 17

Public benchmarks & lm-evaluation-harness

After this lesson you can use public benchmarks (MMLU, HellaSwag, ARC, GSM8K) as honest smoke checks against a checkpoint — not as the truth — run them with EleutherAI's lm-evaluation-harness, read few-shot vs zero-shot numbers correctly, recognise the contamination caveats that quietly inflate headline scores, and stop treating MMLU-65% as a project goal when your gold-set accuracy is the metric that actually pays the bills.

Level: intermediate Read time: ~11 min Prerequisites: LLM-as-a-judge

Every SLM model card carries a constellation of three-letter benchmark names — MMLU, HSWAG, ARC, GSM8K — followed by a percentage. They're convenient. They let you compare against the base, against the previous fine-tune, and against the wider model landscape with one number per axis. They are also widely misread: as the goal of fine-tuning rather than a smoke check, as a measure of your task quality rather than a sanity bar, and (worst) as ground truth in cases where the benchmark has leaked into pretraining. This lesson covers what each benchmark actually measures, how to run them, and how to read the numbers without being misled.

What the standard benchmarks measure

None of these is your task. None of them measures the thing your fine-tune is supposed to do. They measure the substrate the fine-tune sits on top of.

Few-shot vs zero-shot

Almost every reported benchmark number has a shot count attached: MMLU 5-shot, HellaSwag 10-shot, GSM8K 8-shot CoT. The shot count is the number of labelled in-context examples prepended to the prompt before the question. The difference between zero-shot and 5-shot can be 20+ percentage points; reporting one and calling it the other is the most common honest-mistake in benchmark tables.

Read the convention with the score:

lm-evaluation-harness, the standard tool

EleutherAI's lm-evaluation-harness is the de-facto runner for these benchmarks. It supports hundreds of tasks under a unified evaluator, lets you point at a local Hugging Face checkpoint or a hosted endpoint, and reports per-task accuracy + standard error. It is what nearly every public model card was scored with.

# install
pip install lm-eval

# evaluate a local checkpoint on the standard smoke-check set
lm_eval \
  --model hf \
  --model_args pretrained=./runs/sft-v3,dtype=bfloat16 \
  --tasks mmlu,hellaswag,arc_easy,arc_challenge,gsm8k \
  --num_fewshot 5 \
  --batch_size 16 \
  --output_path eval/sft-v3.json

# compare against the base — same flags, different checkpoint
lm_eval \
  --model hf \
  --model_args pretrained=HuggingFaceTB/SmolLM2-135M,dtype=bfloat16 \
  --tasks mmlu,hellaswag,arc_easy,arc_challenge,gsm8k \
  --num_fewshot 5 \
  --batch_size 16 \
  --output_path eval/base.json

The output JSON has per-task accuracy and a standard error you should always report — a 0.4-point delta with a 0.6-point standard error is noise, not progress.

How to read the numbers

The four useful comparisons:

  1. Base vs SFT, same shot count — does SFT preserve substrate knowledge? A modest drop is normal (the fine-tune narrowed the model toward your task). A large drop on MMLU / HellaSwag is catastrophic forgetting; revisit Lesson 1.22's mitigations.
  2. Your fine-tune vs same-size peer models — places you in the landscape. Useful as a "do I have a competitive substrate?" check; not useful as a fine-tune goal.
  3. Across versions of your own fine-tune — for tracking regressions. If MMLU drops by 3 points between two runs, something changed in your data or recipe that you want to understand.
  4. Against the lm-eval-harness standard error, not its average — a difference smaller than the SE is statistical noise, not a result.

Why your gold set still matters more

Three reasons the public benchmarks are insufficient on their own:

Contamination caveats

Public benchmarks are widely scraped, widely indexed, and routinely end up in pretraining corpora. The fingerprint of a contaminated benchmark: the model gets the right answer on the canonical question but fails on lightly paraphrased versions of the same question. Several known contamination patterns:

Mitigations when you actually need a reliable number:

Honest beat — "MMLU 65%" is not a project goal

It's a smoke check. The reason your stakeholder cares about the SLM is the task — whether the JSON parses, whether the extraction lands, whether the support reply was correct. Public-benchmark scores belong in the appendix of the model card, not the project's headline KPI. Build a gold set that measures the task; quote it; then report the substrate numbers as context.

Key idea

Public benchmarks are useful — for placing your model in the landscape, for catching catastrophic regressions during fine-tuning, for a shared vocabulary with the wider community. They are not useful as the truth, the project goal, or a substitute for a task-specific gold set. Run them with lm-evaluation-harness with the shot count published, report the standard error, watch for the canonical-vs-paraphrased contamination signature, and remember that your private gold set is the one that the user pays for.

The next lesson covers the bookkeeping that makes all this comparable across runs three weeks from now: experiment tracking with MLflow and W&B.

Key terms

lm-evaluation-harness
EleutherAI's open-source benchmark runner — the standard tool used to produce the MMLU / HellaSwag / ARC / GSM8K numbers in model cards.
MMLU
Massive Multitask Language Understanding: ~14k multiple-choice questions across 57 subjects. Usually reported 5-shot. The standard substrate-knowledge benchmark.
HellaSwag
Commonsense scenario-continuation MCQs; usually 10-shot. Tests world-model plausibility.
ARC
AI2 Reasoning Challenge: grade-school science MCQs in Easy / Challenge splits.
GSM8K
Grade-school math word problems requiring multi-step arithmetic; the standard chain-of-thought benchmark.
Few-shot vs zero-shot
Few-shot prepends N labelled examples to the prompt; zero-shot prepends none. Different shot counts produce very different scores; the shot count must be reported with the number.
Standard error (lm-eval)
The lm-evaluation-harness reports SE alongside accuracy. A delta smaller than the SE is noise, not a result.
Benchmark contamination
The benchmark's questions or answers appeared in the model's pretraining corpus; the model memorised rather than reasoned. The tell is a large gap between canonical and paraphrased versions of the same question.
Substrate vs task
Public benchmarks measure the substrate of general capability the fine-tune sits on; your gold set measures the task itself. Both matter, in that order.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.