How many canonical RunEvent stages are there?

Nine: ingestion, cleaning, adapter, training, eval, export, deployment, autopilot, system.

What are the three Coach Mode action kinds?

navigate (route the user to a panel), run_playbook (trigger a synthetic-data or transformation playbook), augment_from_cluster (open a failure cluster as a data-generation source).

Track 3 · With BrewSLM · Lesson 12

RunEvent taxonomy & Coach Mode catalogue

After this reference you can recognise any RunEvent on a project's timeline — stage, severity, reason code, payload — and you know the Coach Mode contract: which workflow stages it speaks at, what severity each suggestion can carry, and what the three action kinds do when clicked.

Level: intermediate Read time: ~12 min Type: Reference

BrewSLM's audit spine has two reference surfaces a developer reads daily: the RunEvent stream (what already happened) and Coach Mode (what to do next). Both are stable contracts the platform commits to. This catalogue documents them as they exist in the codebase — schema, taxonomy, action kinds — so the labels you see in the UI map straight to fields you can grep for.

The RunEvent row

One canonical row per "interesting thing happened" across every stage. The schema is append-only — corrections are new rows, never updates. Every row carries:

id              int    primary key (autoincrement)
project_id      int    FK projects.id
run_id          str    "<stage>-<id>" by convention (exp-42, deploy-7, autopilot-{hex})
parent_run_id   str?   parent op when nested (training under autopilot)
stage           str    one of nine canonical values (below)
severity        str    info | warning | error | critical
reason_code     str?   stable code from the taxonomy (required when error/critical)
actor           str    "system" by default; user-id when an operator triggered the event
summary         str?   one-line human readable; UI uses as the row label
payload         json   structured details, free-form per emitter
ts              datetime  wall-clock time of the event
created_at      datetime  insert time (differs from ts for back-dated replays)

The nine canonical stages

Every emitter MUST tag its event with one of nine string constants. New stages are added by extending the lint-gated KNOWN_STAGES set — never by emitting an ad-hoc string.

ingestion     data import / annotation / teacher capture
cleaning      cleaning rules / PII scanning / outlier removal
adapter       Adapter Studio (data-transformation contracts)
training      training runs (SFT, KD)
eval          evaluation runs against eval packs
export        artifact bundling + quantization
deployment    deploy / promote / rollback / drift
autopilot     newbie / strict-mode autopilot orchestration
system        DB / config / extension load (catch-all)

The four severities

info — normal lifecycle ("training completed", "rows imported"). The vast majority of RunEvents.
warning — soft issue worth surfacing but not blocking ("plan profile fell back to safe", "judge degraded").
error — operation failed; must carry a reason code from the taxonomy.
critical — pager-worthy. Reserved; emit sparingly. Must carry a reason code.

The reason-code taxonomy, by stage

Every error or critical RunEvent must set a reason_code from the canonical taxonomy. Emitting an unknown code is rejected at the service boundary (invalid_reason_code:<value>). The taxonomy is grouped by stage:

ingestion

ingest_unsupported_format         file extension not in the supported set
ingest_io_error                   disk write/read failure during ingest
ingest_validation_failed          row failed schema or content validation
dataset_import_run                INFO — generic import pipeline wrote rows
                                  (payload: source, locator, mapper,
                                  accepted/rejected counts, written_path, config_id)
dataset_import_failed             import pipeline raised before any rows written
annotation_job_created            INFO — label job created (payload: job_id, name, label_type, target_rows)
annotation_label_submitted        INFO — reviewer submitted a single label
annotation_rows_promoted          INFO — labeled rows materialized to a downstream dataset
distillation_teacher_capture      INFO — Track 4 slice 1 — teacher top-k logprobs captured

cleaning

cleaning_outlier_threshold_exceeded   outlier removal removed more rows than the configured cap
cleaning_pii_block                    PII / safety scan blocked the dataset from advancing

adapter

adapter_schema_mismatch        declared input/output schema could not be matched to the data
adapter_field_resolution_failed adapter could not resolve a field mapping (missing column, wrong type)

training

training_dispatch_error    failure dispatching the run to the runtime backend
training_runtime_error     generic runtime failure inside the training loop
training_oom               GPU out-of-memory during training
training_timeout           run exceeded its wallclock budget
training_cancelled         operator (or upstream signal) cancelled the run

eval

eval_runtime_error      generic failure inside the evaluation runner
eval_dataset_missing    eval pack referenced a dataset that no longer exists
eval_judge_unavailable  LLM-as-judge call failed (provider down / quota / config)

export

export_run_failed             generic export failure (artifact build, manifest write, smoke check)
export_artifact_missing       required model / tokenizer artifact missing at export time
export_quantization_failed    quantization step exited non-zero or produced an invalid artifact

deployment

deployment_smoke_failed              post-deploy smoke check failed against the live endpoint
deployment_promote_blocked           promote refused (status not promotable / readiness gate failed)
deployment_rollback_no_predecessor   rollback refused because no superseded predecessor exists
deployment_drift_detected            drift check found pass-rate delta beyond tolerance vs baseline

autopilot

autopilot_repair_blocked        autopilot refused a repair (strict mode or unsafe action)
autopilot_strict_mode_refused   strict mode rejected an auto-repair that would otherwise have applied
autopilot_no_safe_plan          autopilot could not construct a safe plan from the current intent

system

system_db_error              unexpected database error from a service-internal write path
system_config_invalid        required configuration value missing or malformed at runtime
extension_load_failed        plugin module import / register hook raised
extension_contract_invalid   plugin module failed one or more contract checks

Stable vs evolving

The nine stages and four severities are stable — they change only with a major contract bump. The reason-code list grows as new failure modes are surfaced, but codes are never silently retired (each addition is a one-line change to the taxonomy file). Treat this as the catalog up to the current contract version; KNOWN_REASON_CODES in the codebase is the runtime source of truth.

Coach Mode — the contract

Coach Mode is a per-stage suggestion service. The UI mounts a CoachStrip per panel that calls GET /api/projects/{id}/coach/{stage} and renders the suggestions returned with click-to-execute action buttons.

The five workflow stages Coach speaks at

data         dataset import / Adapter Studio / synthetic playbook activity
cleaning     cleaning rules and review queue
gold_set     gold-set status and top-up suggestions
training     training config + preflight + active runs
eval         eval pack + failure clusters + remediation

The three severity levels

Each suggestion carries a severity the UI uses to colour the strip consistently:

info — informational ("here's a good next step").
warning — soft ("forecast says this run is borderline; consider topping up the gold set").
critical — urgent ("your gold set has 12 rows — most useful first models need at least 100").

The canonical suggestion shape

{
  "severity": "warning",
  "title":    "Your gold set has 78 rows",
  "body":     "Most useful first models need at least 100 rows of labeled examples ...",
  "action":   { /* one of the three action kinds */ }
}

Generators are read-only — they never mutate project state. Side effects happen only when the user clicks the action button.

The three action kinds

Every action carries a kind, a label, and a params dict. The frontend routes the click on kind alone.

{
  "kind":   "navigate",
  "label":  "Pick a recipe first",
  "params": { "target": "recipe-picker" }
}

navigate — route the user to a panel. params.target identifies the panel (e.g. recipe-picker, training-config). Used when the next step is "go look at this."
run_playbook — trigger a recipe-aware synthetic-data playbook in the background. params carries the playbook mode (e.g. positives_paraphrase) and a target_count. Used for "I'll generate the rows for you" actions.
augment_from_cluster — open a failure cluster as a generation source. params carries the cluster id and a suggested top-up. Used after an eval reveals a cluster of similar failures (Lesson 3.13).

Concrete thresholds you'll see in suggestions

Some Coach generators use named numeric thresholds. The current values for the gold-set generator (for example):

GOLD_ROW_THIN_MAX             99    # <= thin (severity bumps to "critical")
GOLD_ROW_COMFORTABLE_MIN      300   # target the suggested top-up tries to reach
SUGGESTED_TOPUP_FLOOR         20    # don't suggest below this
SUGGESTED_TOPUP_CEILING       500   # don't suggest above this

Treat these as representative, not part of the public contract — they tune over time as the platform learns better defaults. The stable surface is the severity transitions and the action kinds.

Key idea

Two surfaces, one audit spine. RunEvent records what happened with a stable (stage, severity, reason_code) tuple and a JSON payload. Coach Mode tells you what to do next with a (stage, severity, action) shape and three action kinds — navigate, run_playbook, augment_from_cluster. Both are contracts: stages and severities are stable, the reason-code list and threshold values grow with the platform.

The last reference catalogues the consumer that closes the loop on these events: the eval pack's gate schema and the failure-cluster surface that augment_from_cluster opens.

Key terms

RunEvent: The canonical audit-row schema. Append-only; one row per "interesting thing happened" across all nine stages.
RunEvent stage: One of nine canonical strings (ingestion, cleaning, adapter, training, eval, export, deployment, autopilot, system).
RunEvent severity: info / warning / error / critical. Errors and criticals must carry a reason code.
Reason code: Lint-gated string from the canonical taxonomy that names a failure mode (e.g. training_oom, deployment_drift_detected).
Coach stage: One of five workflow surfaces Coach speaks at: data, cleaning, gold_set, training, eval.
Coach action kind: One of three: navigate (route), run_playbook (trigger generation), augment_from_cluster (open a failure cluster as a data source).

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.