Observability and drift in production
After this lesson you can explain how the RunEvent spine makes a pipeline observable, what drift detection does, and how to decide when a production model needs retraining.
A deployed model meets inputs you never trained on, and the world it learned keeps moving. Observability is how you notice when it stops being good — before your users do.
The RunEvent spine, recapped
From Track 3: every stage emits a RunEvent via emit_event() — a reason code from a lint-gated taxonomy plus a severity (info / warning / error / critical). One stream feeds every observability surface: the timeline, the failure-cluster view, the audit explorer, and the support bundle. The emission is best-effort (wrapped in try/except at each call site), so a bug in observability can never break the actual data or training path. That single design choice is what makes the whole pipeline debuggable after the fact.
Drift: quality decays quietly
A model that scored 0.93 at launch can be 0.80 six months later without anyone touching it — because the inputs drifted. New slang, new product names, a new kind of customer. This is drift, and it's invisible unless you measure for it.
Drift detection: re-run the gold set on the live endpoint
BrewSLM's deploy stage (Track 3) includes a scheduled drift check: it periodically re-runs your gold set against the live endpoint and compares to the launch baseline. A significant drop emits deployment_drift_detected — the same gold set you built in Track 1, now a production sensor.
# scheduled drift check (deploy stage)
re-run gold set vs live endpoint
launch: accuracy 0.93
today: accuracy 0.86 → drift! emit deployment_drift_detected
# also: deployment_smoke_failed (promote-time), deployment_promote_blocked
Your gold set keeps working
The gold set you curated for evaluation doesn't retire at launch — it becomes the regression sensor that watches the live model forever. That's a big payoff for the work you put into it in Track 1.
When to retrain
Drift detection tells you that quality dropped; deciding what to do is judgment:
- Drift detected → gather recent real inputs, add them (via the review queue), re-prepare, re-train, re-evaluate against the same gates. The Track 3 loop, triggered by a signal instead of a hunch.
- New failure clusters in production logs → a targeted data fix (augment_from_cluster), not a full retrain.
- Stable → leave it alone. Retraining a healthy model just risks regressions.
Key idea
The RunEvent spine makes every stage observable; best-effort emission means observability never breaks the data path. Drift detection re-runs your gold set against the live endpoint on a schedule, turning it into a production sensor — and a drop is the signal to gather fresh data and retrain through the same gates.
Key terms
- observability
- The ability to see what the pipeline did and how the live model is performing.
- RunEvent spine
- The one event stream every stage emits to, feeding all monitoring surfaces; best-effort so it can't break the data path.
- drift
- Production quality decay caused by input distributions changing over time.
- drift detection
- A scheduled re-run of the gold set against the live endpoint to catch quality drops.
- support bundle
- A consumer of the RunEvent stream packaging the events/state needed to debug an issue.
- retrain trigger
- Using a drift signal or new failure clusters to decide when to gather data and retrain.
Check yourself
Answers are saved to this browser.