RunEvent taxonomy & Coach Mode catalogue
After this reference you can recognise any RunEvent on a project's timeline — stage, severity, reason code, payload — and you know the Coach Mode contract: which workflow stages it speaks at, what severity each suggestion can carry, and what the three action kinds do when clicked.
BrewSLM's audit spine has two reference surfaces a developer reads daily: the RunEvent stream (what already happened) and Coach Mode (what to do next). Both are stable contracts the platform commits to. This catalogue documents them as they exist in the codebase — schema, taxonomy, action kinds — so the labels you see in the UI map straight to fields you can grep for.
The RunEvent row
One canonical row per "interesting thing happened" across every stage. The schema is append-only — corrections are new rows, never updates. Every row carries:
id int primary key (autoincrement)
project_id int FK projects.id
run_id str "<stage>-<id>" by convention (exp-42, deploy-7, autopilot-{hex})
parent_run_id str? parent op when nested (training under autopilot)
stage str one of nine canonical values (below)
severity str info | warning | error | critical
reason_code str? stable code from the taxonomy (required when error/critical)
actor str "system" by default; user-id when an operator triggered the event
summary str? one-line human readable; UI uses as the row label
payload json structured details, free-form per emitter
ts datetime wall-clock time of the event
created_at datetime insert time (differs from ts for back-dated replays)
The nine canonical stages
Every emitter MUST tag its event with one of nine string constants. New stages are added by extending the lint-gated KNOWN_STAGES set — never by emitting an ad-hoc string.
ingestion data import / annotation / teacher capture
cleaning cleaning rules / PII scanning / outlier removal
adapter Adapter Studio (data-transformation contracts)
training training runs (SFT, KD)
eval evaluation runs against eval packs
export artifact bundling + quantization
deployment deploy / promote / rollback / drift
autopilot newbie / strict-mode autopilot orchestration
system DB / config / extension load (catch-all)
The four severities
info— normal lifecycle ("training completed", "rows imported"). The vast majority of RunEvents.warning— soft issue worth surfacing but not blocking ("plan profile fell back to safe", "judge degraded").error— operation failed; must carry a reason code from the taxonomy.critical— pager-worthy. Reserved; emit sparingly. Must carry a reason code.
The reason-code taxonomy, by stage
Every error or critical RunEvent must set a reason_code from the canonical taxonomy. Emitting an unknown code is rejected at the service boundary (invalid_reason_code:<value>). The taxonomy is grouped by stage:
ingestion
ingest_unsupported_format file extension not in the supported set
ingest_io_error disk write/read failure during ingest
ingest_validation_failed row failed schema or content validation
dataset_import_run INFO — generic import pipeline wrote rows
(payload: source, locator, mapper,
accepted/rejected counts, written_path, config_id)
dataset_import_failed import pipeline raised before any rows written
annotation_job_created INFO — label job created (payload: job_id, name, label_type, target_rows)
annotation_label_submitted INFO — reviewer submitted a single label
annotation_rows_promoted INFO — labeled rows materialized to a downstream dataset
distillation_teacher_capture INFO — Track 4 slice 1 — teacher top-k logprobs captured
cleaning
cleaning_outlier_threshold_exceeded outlier removal removed more rows than the configured cap
cleaning_pii_block PII / safety scan blocked the dataset from advancing
adapter
adapter_schema_mismatch declared input/output schema could not be matched to the data
adapter_field_resolution_failed adapter could not resolve a field mapping (missing column, wrong type)
training
training_dispatch_error failure dispatching the run to the runtime backend
training_runtime_error generic runtime failure inside the training loop
training_oom GPU out-of-memory during training
training_timeout run exceeded its wallclock budget
training_cancelled operator (or upstream signal) cancelled the run
eval
eval_runtime_error generic failure inside the evaluation runner
eval_dataset_missing eval pack referenced a dataset that no longer exists
eval_judge_unavailable LLM-as-judge call failed (provider down / quota / config)
export
export_run_failed generic export failure (artifact build, manifest write, smoke check)
export_artifact_missing required model / tokenizer artifact missing at export time
export_quantization_failed quantization step exited non-zero or produced an invalid artifact
deployment
deployment_smoke_failed post-deploy smoke check failed against the live endpoint
deployment_promote_blocked promote refused (status not promotable / readiness gate failed)
deployment_rollback_no_predecessor rollback refused because no superseded predecessor exists
deployment_drift_detected drift check found pass-rate delta beyond tolerance vs baseline
autopilot
autopilot_repair_blocked autopilot refused a repair (strict mode or unsafe action)
autopilot_strict_mode_refused strict mode rejected an auto-repair that would otherwise have applied
autopilot_no_safe_plan autopilot could not construct a safe plan from the current intent
system
system_db_error unexpected database error from a service-internal write path
system_config_invalid required configuration value missing or malformed at runtime
extension_load_failed plugin module import / register hook raised
extension_contract_invalid plugin module failed one or more contract checks
Stable vs evolving
The nine stages and four severities are stable — they change only with a major contract bump. The reason-code list grows as new failure modes are surfaced, but codes are never silently retired (each addition is a one-line change to the taxonomy file). Treat this as the catalog up to the current contract version; KNOWN_REASON_CODES in the codebase is the runtime source of truth.
Coach Mode — the contract
Coach Mode is a per-stage suggestion service. The UI mounts a CoachStrip per panel that calls GET /api/projects/{id}/coach/{stage} and renders the suggestions returned with click-to-execute action buttons.
The five workflow stages Coach speaks at
data dataset import / Adapter Studio / synthetic playbook activity
cleaning cleaning rules and review queue
gold_set gold-set status and top-up suggestions
training training config + preflight + active runs
eval eval pack + failure clusters + remediation
The three severity levels
Each suggestion carries a severity the UI uses to colour the strip consistently:
info— informational ("here's a good next step").warning— soft ("forecast says this run is borderline; consider topping up the gold set").critical— urgent ("your gold set has 12 rows — most useful first models need at least 100").
The canonical suggestion shape
{
"severity": "warning",
"title": "Your gold set has 78 rows",
"body": "Most useful first models need at least 100 rows of labeled examples ...",
"action": { /* one of the three action kinds */ }
}
Generators are read-only — they never mutate project state. Side effects happen only when the user clicks the action button.
The three action kinds
Every action carries a kind, a label, and a params dict. The frontend routes the click on kind alone.
{
"kind": "navigate",
"label": "Pick a recipe first",
"params": { "target": "recipe-picker" }
}
navigate— route the user to a panel.params.targetidentifies the panel (e.g.recipe-picker,training-config). Used when the next step is "go look at this."run_playbook— trigger a recipe-aware synthetic-data playbook in the background.paramscarries the playbook mode (e.g.positives_paraphrase) and atarget_count. Used for "I'll generate the rows for you" actions.augment_from_cluster— open a failure cluster as a generation source.paramscarries the cluster id and a suggested top-up. Used after an eval reveals a cluster of similar failures (Lesson 3.13).
Concrete thresholds you'll see in suggestions
Some Coach generators use named numeric thresholds. The current values for the gold-set generator (for example):
GOLD_ROW_THIN_MAX 99 # <= thin (severity bumps to "critical")
GOLD_ROW_COMFORTABLE_MIN 300 # target the suggested top-up tries to reach
SUGGESTED_TOPUP_FLOOR 20 # don't suggest below this
SUGGESTED_TOPUP_CEILING 500 # don't suggest above this
Treat these as representative, not part of the public contract — they tune over time as the platform learns better defaults. The stable surface is the severity transitions and the action kinds.
Key idea
Two surfaces, one audit spine. RunEvent records what happened with a stable (stage, severity, reason_code) tuple and a JSON payload. Coach Mode tells you what to do next with a (stage, severity, action) shape and three action kinds — navigate, run_playbook, augment_from_cluster. Both are contracts: stages and severities are stable, the reason-code list and threshold values grow with the platform.
The last reference catalogues the consumer that closes the loop on these events: the eval pack's gate schema and the failure-cluster surface that augment_from_cluster opens.
Key terms
- RunEvent
- The canonical audit-row schema. Append-only; one row per "interesting thing happened" across all nine stages.
- RunEvent stage
- One of nine canonical strings (
ingestion,cleaning,adapter,training,eval,export,deployment,autopilot,system). - RunEvent severity
info/warning/error/critical. Errors and criticals must carry a reason code.- Reason code
- Lint-gated string from the canonical taxonomy that names a failure mode (e.g.
training_oom,deployment_drift_detected). - Coach stage
- One of five workflow surfaces Coach speaks at:
data,cleaning,gold_set,training,eval. - Coach action kind
- One of three:
navigate(route),run_playbook(trigger generation),augment_from_cluster(open a failure cluster as a data source).
Check yourself
Answers are saved to this browser.