What decides whether a model is promotable?

The gates declared in the eval pack

Where does the shape of the metrics come from?

The handler's score() for that task shape

What is a failure cluster?

A group of eval misses sharing a pattern

A cluster's remediation is usually…

Augment from the cluster, review, re-prepare, re-train

Track 3 · With BrewSLM · Lesson 8

Evaluate: eval packs, gates, failure clusters & remediation

After this lesson you can read an eval pack's gates as the quality bar from Track 1, interpret task-aware metrics, and use failure clusters and remediation plans to decide what data to fix — the scaled-up version of your by-hand failure analysis.

Level: intermediate Read time: ~11 min Prerequisites: Train: jobs, the bell, and the delta-from-baseline curve

In lesson 2.7 you ran a gold set by hand, computed accuracy, compared to the base, and read the misses to decide what data to fix. Stage 09 is that whole loop as a platform surface — with the quality gate from Track 1 made enforceable.

The eval pack and its gates

Evaluate takes the trained model, the held-out eval set (the manifest's eval.jsonl), and an eval pack. The eval pack declares the metrics to compute and the gates that decide promotability — the pass thresholds a model must clear to be shippable. This is precisely the Track 1 quality gate, now a declared object the platform enforces rather than a number you remember to check.

eval_pack:
  metrics: [accuracy, macro_f1]
  gates:
    - metric: accuracy   min: 0.90      # must clear to promote
    - metric: macro_f1   min: 0.85
  compare_to: baseline                  # report lift over the base model

Task-aware metrics

You computed accuracy because yours was a classification task. BrewSLM computes whatever the handler's score() defines for the task shape — so the metric fits the task automatically:

classification   →  per-class precision / recall / F1, accuracy
extraction       →  span-set / field-match scores
RAG              →  faithfulness
alignment        →  preference margin
seq2seq          →  BLEU / ROUGE
transcription    →  WER

The result is a pass rate plus task-aware metrics, emitted as an eval RunEvent parented to the experiment (exp-<id>), with named failures like eval_dataset_missing or eval_judge_unavailable.

Failure clusters — your misses, grouped

Reading individual wrong predictions (what you did by hand) doesn't scale to thousands of eval rows. BrewSLM groups the failures into failure clusters: buckets of misses that share a pattern — a confused class pair, a particular input phrasing, an edge case the data underrepresents. Instead of skimming a list, you see "37 failures: negative reviews containing the word 'but'."

From Track 2

Your failure-analysis loop ("which misses tell me what data to add?") is exactly this — clusters just do the grouping so the signal is obvious. Each cluster is a concrete instruction for the next data iteration.

Remediation — close the loop

Each cluster comes with a remediation plan: the recommended fix, often "augment from this cluster" — generate or import more rows like the ones failing, route them through the review queue (lesson 3.3), re-prepare, and re-train. This is the Coach-Mode augment_from_cluster action, and it's the data-centric iteration of Track 1 turned into a button. If the model cleared the gates, it's promotable. If it didn't — and especially if more data won't fix it — the platform's next move is to ask whether fine-tuning is even the right tool, which is the next lesson.

Key idea

An eval pack turns the Track 1 quality gate into an enforceable promotability check; task-aware metrics fit the task automatically; failure clusters + remediation turn your by-hand 'read the misses' loop into grouped, actionable next steps.

Key terms

eval pack: The declared set of metrics and promotion gates evaluated against the held-out eval set.
promotability gate: A pass threshold in the eval pack a model must clear to be shippable (the Track 1 quality gate).
task-aware metrics: Metrics defined by the handler's score() to fit the task shape (F1, span-set, faithfulness, WER, …).
failure cluster: A bucket of eval misses sharing a pattern, surfaced instead of a flat list.
remediation plan: The recommended fix for a cluster — often augment_from_cluster to add similar rows and re-train.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

The eval pack and its gates

Task-aware metrics

Failure clusters — your misses, grouped

Remediation — close the loop

Key terms

Check yourself

Related lessons