Track 4 · Advanced · Lesson 9

Observability and drift in production

After this lesson you can explain how the RunEvent spine makes a pipeline observable, what drift detection does, and how to decide when a production model needs retraining.

Level: advanced Read time: ~9 min Prerequisites: Serving and inference optimization

A deployed model meets inputs you never trained on, and the world it learned keeps moving. Observability is how you notice when it stops being good — before your users do.

The RunEvent spine, recapped

From Track 3: every stage emits a RunEvent via emit_event() — a reason code from a lint-gated taxonomy plus a severity (info / warning / error / critical). One stream feeds every observability surface: the timeline, the failure-cluster view, the audit explorer, and the support bundle. The emission is best-effort (wrapped in try/except at each call site), so a bug in observability can never break the actual data or training path. That single design choice is what makes the whole pipeline debuggable after the fact.

Drift: quality decays quietly

A model that scored 0.93 at launch can be 0.80 six months later without anyone touching it — because the inputs drifted. New slang, new product names, a new kind of customer. This is drift, and it's invisible unless you measure for it.

Drift detection: re-run the gold set on the live endpoint

BrewSLM's deploy stage (Track 3) includes a scheduled drift check: it periodically re-runs your gold set against the live endpoint and compares to the launch baseline. A significant drop emits deployment_drift_detected — the same gold set you built in Track 1, now a production sensor.

# scheduled drift check (deploy stage)
re-run gold set vs live endpoint
  launch:  accuracy 0.93
  today:   accuracy 0.86   → drift!  emit deployment_drift_detected
# also: deployment_smoke_failed (promote-time), deployment_promote_blocked

Your gold set keeps working

The gold set you curated for evaluation doesn't retire at launch — it becomes the regression sensor that watches the live model forever. That's a big payoff for the work you put into it in Track 1.

When to retrain

Drift detection tells you that quality dropped; deciding what to do is judgment:

Key idea

The RunEvent spine makes every stage observable; best-effort emission means observability never breaks the data path. Drift detection re-runs your gold set against the live endpoint on a schedule, turning it into a production sensor — and a drop is the signal to gather fresh data and retrain through the same gates.

Key terms

observability
The ability to see what the pipeline did and how the live model is performing.
RunEvent spine
The one event stream every stage emits to, feeding all monitoring surfaces; best-effort so it can't break the data path.
drift
Production quality decay caused by input distributions changing over time.
drift detection
A scheduled re-run of the gold set against the live endpoint to catch quality drops.
support bundle
A consumer of the RunEvent stream packaging the events/state needed to debug an issue.
retrain trigger
Using a drift signal or new failure clusters to decide when to gather data and retrain.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.