Eval pack & failure cluster reference
After this reference you can read an eval pack's gate list as a promotability decision, identify which gates are required vs informational, and recognise a failure-cluster row by its (project_id, stage, reason_code, signature) key — then map the cluster to a remediation via augment_from_cluster.
This last reference catalogues the two surfaces the evaluation lifecycle exposes: the eval pack (gates that decide whether a trained model is promotable) and the failure cluster (the grouping of similar errors that drives the data-iteration loop). Lesson 3.8's narrative explained the role; this lesson documents the schema.
Eval pack — the envelope
An eval pack is identified by id (e.g. evalpack.general.default, evalpack.domain-profile) and conforms to the Evaluation Contract v2 (slm.evaluation-pack/v2). The pack carries per-task specs and a backward-compatible top-level gate list.
{
"id": "evalpack.general.default",
"contract_version": "slm.evaluation-pack/v2",
"description": "...",
"gates": [ /* default task profile's gates */ ],
"task_specs": [ /* per-task-profile specs (preferred) */ ],
"default_task_profile": "instruction_sft"
}
The legacy top-level gates field still works for single-task projects; modern packs put their gates inside task_specs so different task profiles in one project can each carry their own promotion criteria.
The gate schema (Evaluation Contract v2)
Every gate is a small dict. Four fields are required; three are optional.
{
"gate_id": "min_macro_f1", // stable id; defaults to f"min_{metric_id}"
"metric_id": "macro_f1", // the metric the gate checks
"operator": "gte", // "gte" (>=) or "lte" (<=); default "gte"
"threshold": 0.90, // the pass value
"required": true, // required gates block promotion
"source": "blueprint", // optional — where the gate came from
"weight": 1.0 // optional — relative importance
}
operatoris normalised: anything outside{gte, lte}falls back togte. There is noeq, noin, no string match — gates are scalar comparisons.required: truemeans the gate blocks promotion. Arequired: falsegate is reported but doesn't block — useful for "we'd like this to hit 0.85 but don't fail the deploy."sourceis informational: it records where the gate came from (e.g.blueprint,recipe,user_override) so the audit can trace why a deploy was blocked.
Per-task specs (the canonical v2 form)
{
"task_specs": [
{
"task_profile": "instruction_sft",
"display_name": "Instruction SFT",
"description": "...",
"required_metric_ids": ["macro_f1", "exact_match"],
"metric_schema": { "macro_f1": { /* aliases, range, ... */ } },
"gates": [
{ "gate_id": "min_macro_f1", "metric_id": "macro_f1", "operator": "gte", "threshold": 0.90, "required": true },
{ "gate_id": "min_exact_match", "metric_id": "exact_match", "operator": "gte", "threshold": 0.85, "required": true }
]
}
]
}
A pack with multiple task profiles supports projects whose blueprint says "we're training a classifier and a span extractor"; each profile gets its own gates and metric schema. required_metric_ids is auto-extended to include every metric referenced by a required gate, so a typo in a metric name there surfaces as a missing-metric error at apply rather than a silently un-evaluated gate.
How gates decide promotability
At Evaluate stage time (Track 3, Lesson 3.8) the trainer's score() produces a metrics dict — actual values for the metrics the task spec declared. The decision is then mechanical:
- For every gate in the task spec, look up
metrics[gate.metric_id]. - If missing — gate is unmet.
- If present — check the operator against the threshold.
- If any required gate is unmet, the run is not promotable — the deploy stage refuses with
deployment_promote_blocked(Lesson 3.12's taxonomy). - Non-required gates are reported in the eval result but never block.
Honest reporting: required vs informational
The pack's report distinguishes PASS / FAIL / BELOW_THRESHOLD on every gate, regardless of required. A model that clears its required gates but fails an informational one still ships — and the eval result row carries the failing gate explicitly. This is the "show the gate that failed" rule the Academy keeps coming back to: don't hide informational failures inside the headline "promotable: true."
The metric_schema field
Each task spec carries a metric_schema dict mapping metric_id to a small descriptor. The descriptor's role is to make metrics self-describing in the audit:
{
"macro_f1": {
"aliases": ["macro_f1", "macroF1", "f1_macro"], // accepted variants in handler output
"display_name":"Macro F1",
"range": [0.0, 1.0],
"higher_is": "better"
}
}
If a handler emits the metric under one of the listed aliases, the eval pack normalises it to the canonical metric_id before checking gates. That's how a third-party score function whose code says "f1_macro" still passes a gate written against "macro_f1".
FailureCluster — the row schema
Nightly clustering reads error / critical RunEvents (Lesson 3.12) and groups them. The result is a failure_clusters table with one row per distinct failure shape per project. The unique key is (project_id, stage, reason_code, signature):
id int primary key
project_id int FK projects.id
stage str(32) from KNOWN_STAGES
reason_code str(128) from the taxonomy
signature str(64) hash of the canonical failure shape
failure_count int running count of events in this cluster
first_seen_at datetime when this signature first appeared
last_seen_at datetime updated on every new occurrence
exemplar_event_ids json (list[int]) capped list of representative RunEvent ids
exemplar_summaries json (list[str]) first-line summaries of those events
exemplar_run_ids json (list[str]) the run_ids that emitted those exemplars
created_at datetime
last_computed_at datetime refresh marker for clustering runs
signatureis a hash of the canonical shape of the failure — for OOM events, things like model id + batch size + sequence length. Same model + same batch + same seq → same signature → same cluster.failure_countgrows over time; the cluster is the unit of "this keeps happening" not "this happened once."- Exemplar lists are capped (small, not unbounded) — the cluster carries enough rows to investigate without each cluster ballooning.
Remediation: the augment_from_cluster path
The remediation contract is: a cluster of eval failures becomes a generation source. Coach Mode's augment_from_cluster action (Lesson 3.12) takes the cluster id and a suggested top-up and routes the user to a generation flow seeded by the cluster's exemplars. The flow:
- User clicks an
augment_from_clusterbutton on a cluster card. - The frontend calls the synthetic-data playbook with
cluster_idinparams. - The playbook reads the cluster's
exemplar_summaries+ the underlying RunEvent payloads to paraphrase / generalise / negate the failure shape. - The generated rows land in the synthetic dataset with
review_status="pending"(Lesson 3.3). - The user reviews; approved rows roll into the next training round.
That whole loop is the Track-3 version of "read the failures, add more data, retrain" — gated and audited rather than ad-hoc.
Key idea
Two halves of one decision loop. The eval pack declares the promotion contract — every gate is a scalar comparison with required deciding whether it blocks. The failure cluster closes the loop — distinct failure shapes are surfaced and routed through augment_from_cluster back into the data layer. Both are schemas you can grep for: gate dicts in the pack, (project_id, stage, reason_code, signature)-keyed rows in failure_clusters. That's the v2 contract.
That completes the With BrewSLM track's reference layer. Combined with the narrative lessons (3.1–3.10), you can drive the platform end-to-end and inspect any surface it presents.
Key terms
- Evaluation Contract v2
slm.evaluation-pack/v2— the eval-pack schema with per-task specs (each carrying its own gates and metric schema).- Gate
- A scalar comparison:
{gate_id, metric_id, operator: gte|lte, threshold, required, source?, weight?}. - Required gate
- A gate with
required: true; an unmet required gate blocks promotion (emitsdeployment_promote_blocked). - Task spec
- Per-task-profile section of an eval pack carrying
required_metric_ids,metric_schema, and the gate list. - FailureCluster
- Row in
failure_clusters, uniquely keyed on(project_id, stage, reason_code, signature); carries counts, timestamps, and capped exemplar lists. augment_from_cluster- Coach Mode action that opens a failure cluster as a generation source for the synthetic-data playbook.
Check yourself
Answers are saved to this browser.