Track 2 · Hands-on · Lesson 12

Real metrics with sklearn & HF evaluate

After this lesson you can replace Lesson 2.7's hand-rolled accuracy with per-class precision, recall, and F1 using sklearn's classification_report, read a confusion matrix, distinguish macro vs micro F1 on imbalanced data, and use HF evaluate for the shared standards everyone reports.

Level: intermediate Read time: ~9 min Prerequisites: Evaluate by hand: run the gold set, compute the metric

Lesson 2.7 hand-rolled accuracy: count the rights, divide by the total, print one number. That was useful for teaching what an eval is, but in any real project you'll be asked harder questions — "is it confusing positives for neutrals?", "how well does it do on the minority class?", "what does your F1 actually measure?" Those need real metrics. We have two clean tools: sklearn for the standard classification metrics, and HF evaluate for the shared-standard interface used in papers and leaderboards.

Why accuracy alone misleads

Suppose your gold set is 90% positive, 10% negative — typical for a slightly skewed real-world dataset. A model that always predicts "positive" achieves 90% accuracy without learning anything. The hand-rolled metric from Lesson 2.7 would call that run a strong success. Per-class precision and recall would immediately show the negative class was never recovered. The first lesson of real evaluation is: report per-class.

sklearn's classification_report — the workhorse

One call gives you per-class precision, recall, F1, support, plus macro and weighted averages. It's the right default for any classification eval.

from sklearn.metrics import classification_report

# From Lesson 2.7's predict()
y_true = [g["label"]                       for g in gold]
y_pred = [predict(tuned, g["prompt"])      for g in gold]

print(classification_report(y_true, y_pred, digits=3))
#               precision    recall  f1-score   support
#     negative      0.857     0.857     0.857        14
#      neutral      0.500     0.333     0.400         6
#     positive      0.943     0.971     0.957        70
#
#     accuracy                          0.911        90
#    macro avg      0.767     0.720     0.738        90
# weighted avg      0.901     0.911     0.905        90

This single report contains the whole story. Accuracy 91% looks fine until you see the neutral class is at 40% F1 — that's where the model is weak, and that's where your next data round should focus.

The confusion matrix — where the misses go

Failure analysis (Lesson 2.7) gets concrete with a confusion matrix: rows are the true labels, columns are the predicted labels, and the diagonal is correct. Off-diagonal cells tell you which classes the model confuses for which.

from sklearn.metrics import confusion_matrix
import pandas as pd

labels = ["negative", "neutral", "positive"]
cm = confusion_matrix(y_true, y_pred, labels=labels)
print(pd.DataFrame(cm, index=labels, columns=labels))
#           negative  neutral  positive
# negative        12        1         1
# neutral          1        2         3
# positive         1        1        68

Read the row for neutral: of 6 true neutrals, 3 were predicted positive. That's a directional mistake — the model is collapsing the middle class toward positive. The fix isn't more epochs; it's harder neutral examples in the next training round (Lesson 1.7's hard-negative / ambiguous-case section, made specific).

Macro vs micro F1: pick the one that doesn't lie

Two ways to average per-class F1 into one number, and the choice matters on imbalanced data:

For the report above: macro-F1 = 0.738, weighted F1 = 0.905. The "model is great" story tells the weighted; the "neutral class needs work" story tells the macro. Both are true; pick the one that matches the decision you're making.

HF evaluate — shared standards across the ecosystem

Hugging Face's evaluate library exposes the same metrics through one interface, with the canonical implementations used in papers and leaderboards. Worth using when you want your numbers to be directly comparable to published ones.

import evaluate

# label name → index so the model's text outputs match what evaluate expects
labels = ["negative", "neutral", "positive"]
to_idx = {l: i for i, l in enumerate(labels)}
y_true_idx = [to_idx[y] for y in y_true]
y_pred_idx = [to_idx.get(y, -1) for y in y_pred]   # -1 for off-vocab outputs

f1 = evaluate.load("f1")
print("macro f1   :", f1.compute(predictions=y_pred_idx, references=y_true_idx, average="macro")["f1"])
print("weighted f1:", f1.compute(predictions=y_pred_idx, references=y_true_idx, average="weighted")["f1"])

acc = evaluate.load("accuracy")
print("accuracy   :", acc.compute(predictions=y_pred_idx, references=y_true_idx)["accuracy"])

evaluate shines for non-classification tasks too: rouge, bleu, squad, wer, seqeval (for span extraction) — same API, same install. For pure-classification work, sklearn is fine and arguably nicer to read.

Honest beat — report the numbers that don't flatter you

For the eval above, "91% accuracy" is technically true and sounds great. "Neutral F1 = 0.40" is also true and is what an engineer who has to use this model needs to know. Report both — and lead with the per-class breakdown when classes are imbalanced. The single-number temptation is real; resist it. The same loop that makes the metric reflect reality is the loop that makes the next training round work.

Acting on what classification_report tells you

The report is the input to the next data iteration:

Key idea

One classification_report + one confusion_matrix tells you almost everything you need. Per-class precision / recall / F1, the macro-vs-weighted average, where the misses go. On imbalanced data, lead with macro-F1 and the per-class table; on balanced data, accuracy and weighted-F1 are fine. The metric you report shapes the data iteration that follows — choose so it points at the gap, not away from it.

Real metrics close one half of the eval gap. The other half — when your model's outputs aren't a label but a JSON object — is the next lesson: structured-output evaluation with Pydantic.

Key terms

classification_report
sklearn function returning per-class precision/recall/F1/support plus macro/weighted averages — the workhorse classification eval.
Confusion matrix
True-vs-predicted matrix; off-diagonal cells show which classes are confused for which, directional.
Macro F1
Mean of per-class F1s with equal weight; penalises minority-class failures. The honest default on imbalanced data.
Weighted / micro F1
F1 averaged by class support; dominated by the majority class. Useful when eval frequencies match production.
HF evaluate
Hugging Face library exposing standard metrics (F1, accuracy, ROUGE, BLEU, WER, SQuAD) through one interface; aligns with the shared standards used in papers.
Support
The number of true examples of a given class in the eval set; small support → noisier per-class metrics.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.