Track 3 · With BrewSLM · Lesson 13

Eval pack & failure cluster reference

After this reference you can read an eval pack's gate list as a promotability decision, identify which gates are required vs informational, and recognise a failure-cluster row by its (project_id, stage, reason_code, signature) key — then map the cluster to a remediation via augment_from_cluster.

Level: intermediate Read time: ~11 min Type: Reference

This last reference catalogues the two surfaces the evaluation lifecycle exposes: the eval pack (gates that decide whether a trained model is promotable) and the failure cluster (the grouping of similar errors that drives the data-iteration loop). Lesson 3.8's narrative explained the role; this lesson documents the schema.

Eval pack — the envelope

An eval pack is identified by id (e.g. evalpack.general.default, evalpack.domain-profile) and conforms to the Evaluation Contract v2 (slm.evaluation-pack/v2). The pack carries per-task specs and a backward-compatible top-level gate list.

{
  "id":                  "evalpack.general.default",
  "contract_version":    "slm.evaluation-pack/v2",
  "description":         "...",
  "gates":               [ /* default task profile's gates */ ],
  "task_specs":          [ /* per-task-profile specs (preferred) */ ],
  "default_task_profile": "instruction_sft"
}

The legacy top-level gates field still works for single-task projects; modern packs put their gates inside task_specs so different task profiles in one project can each carry their own promotion criteria.

The gate schema (Evaluation Contract v2)

Every gate is a small dict. Four fields are required; three are optional.

{
  "gate_id":   "min_macro_f1",        // stable id; defaults to f"min_{metric_id}"
  "metric_id": "macro_f1",            // the metric the gate checks
  "operator":  "gte",                 // "gte" (>=) or "lte" (<=); default "gte"
  "threshold": 0.90,                  // the pass value
  "required":  true,                  // required gates block promotion
  "source":    "blueprint",           // optional — where the gate came from
  "weight":    1.0                    // optional — relative importance
}

Per-task specs (the canonical v2 form)

{
  "task_specs": [
    {
      "task_profile":         "instruction_sft",
      "display_name":         "Instruction SFT",
      "description":          "...",
      "required_metric_ids":  ["macro_f1", "exact_match"],
      "metric_schema":        { "macro_f1": { /* aliases, range, ... */ } },
      "gates": [
        { "gate_id": "min_macro_f1",    "metric_id": "macro_f1",    "operator": "gte", "threshold": 0.90, "required": true },
        { "gate_id": "min_exact_match", "metric_id": "exact_match", "operator": "gte", "threshold": 0.85, "required": true }
      ]
    }
  ]
}

A pack with multiple task profiles supports projects whose blueprint says "we're training a classifier and a span extractor"; each profile gets its own gates and metric schema. required_metric_ids is auto-extended to include every metric referenced by a required gate, so a typo in a metric name there surfaces as a missing-metric error at apply rather than a silently un-evaluated gate.

How gates decide promotability

At Evaluate stage time (Track 3, Lesson 3.8) the trainer's score() produces a metrics dict — actual values for the metrics the task spec declared. The decision is then mechanical:

  1. For every gate in the task spec, look up metrics[gate.metric_id].
  2. If missing — gate is unmet.
  3. If present — check the operator against the threshold.
  4. If any required gate is unmet, the run is not promotable — the deploy stage refuses with deployment_promote_blocked (Lesson 3.12's taxonomy).
  5. Non-required gates are reported in the eval result but never block.

Honest reporting: required vs informational

The pack's report distinguishes PASS / FAIL / BELOW_THRESHOLD on every gate, regardless of required. A model that clears its required gates but fails an informational one still ships — and the eval result row carries the failing gate explicitly. This is the "show the gate that failed" rule the Academy keeps coming back to: don't hide informational failures inside the headline "promotable: true."

The metric_schema field

Each task spec carries a metric_schema dict mapping metric_id to a small descriptor. The descriptor's role is to make metrics self-describing in the audit:

{
  "macro_f1": {
    "aliases":     ["macro_f1", "macroF1", "f1_macro"],   // accepted variants in handler output
    "display_name":"Macro F1",
    "range":       [0.0, 1.0],
    "higher_is":   "better"
  }
}

If a handler emits the metric under one of the listed aliases, the eval pack normalises it to the canonical metric_id before checking gates. That's how a third-party score function whose code says "f1_macro" still passes a gate written against "macro_f1".

FailureCluster — the row schema

Nightly clustering reads error / critical RunEvents (Lesson 3.12) and groups them. The result is a failure_clusters table with one row per distinct failure shape per project. The unique key is (project_id, stage, reason_code, signature):

id                  int                primary key
project_id          int                FK projects.id
stage               str(32)            from KNOWN_STAGES
reason_code         str(128)           from the taxonomy
signature           str(64)            hash of the canonical failure shape
failure_count       int                running count of events in this cluster
first_seen_at       datetime           when this signature first appeared
last_seen_at        datetime           updated on every new occurrence
exemplar_event_ids  json (list[int])   capped list of representative RunEvent ids
exemplar_summaries  json (list[str])   first-line summaries of those events
exemplar_run_ids    json (list[str])   the run_ids that emitted those exemplars
created_at          datetime
last_computed_at    datetime           refresh marker for clustering runs

Remediation: the augment_from_cluster path

The remediation contract is: a cluster of eval failures becomes a generation source. Coach Mode's augment_from_cluster action (Lesson 3.12) takes the cluster id and a suggested top-up and routes the user to a generation flow seeded by the cluster's exemplars. The flow:

  1. User clicks an augment_from_cluster button on a cluster card.
  2. The frontend calls the synthetic-data playbook with cluster_id in params.
  3. The playbook reads the cluster's exemplar_summaries + the underlying RunEvent payloads to paraphrase / generalise / negate the failure shape.
  4. The generated rows land in the synthetic dataset with review_status="pending" (Lesson 3.3).
  5. The user reviews; approved rows roll into the next training round.

That whole loop is the Track-3 version of "read the failures, add more data, retrain" — gated and audited rather than ad-hoc.

Key idea

Two halves of one decision loop. The eval pack declares the promotion contract — every gate is a scalar comparison with required deciding whether it blocks. The failure cluster closes the loop — distinct failure shapes are surfaced and routed through augment_from_cluster back into the data layer. Both are schemas you can grep for: gate dicts in the pack, (project_id, stage, reason_code, signature)-keyed rows in failure_clusters. That's the v2 contract.

That completes the With BrewSLM track's reference layer. Combined with the narrative lessons (3.1–3.10), you can drive the platform end-to-end and inspect any surface it presents.

Key terms

Evaluation Contract v2
slm.evaluation-pack/v2 — the eval-pack schema with per-task specs (each carrying its own gates and metric schema).
Gate
A scalar comparison: {gate_id, metric_id, operator: gte|lte, threshold, required, source?, weight?}.
Required gate
A gate with required: true; an unmet required gate blocks promotion (emits deployment_promote_blocked).
Task spec
Per-task-profile section of an eval pack carrying required_metric_ids, metric_schema, and the gate list.
FailureCluster
Row in failure_clusters, uniquely keyed on (project_id, stage, reason_code, signature); carries counts, timestamps, and capped exemplar lists.
augment_from_cluster
Coach Mode action that opens a failure cluster as a generation source for the synthetic-data playbook.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.