Train: jobs, the bell, and the delta-from-baseline curve
After this lesson you can launch training as a background Job, monitor it via the notification bell and the live loss curve, read the delta-from-baseline view, and recognize the named training failure events.
This is the stage you know best — it's trainer.train() from lesson 2.5. The difference is everything around the loop: it runs as a tracked Job, streams progress to a bell, and plots its loss against your baseline.
Launch — one click or one call
From the Training Config page you launch with the chosen recipe; or scripted:
$ curl -X POST localhost:8000/api/projects/1/training/run \
-H 'Content-Type: application/json' \
-d '{"autopilot": true, "one_click": true}'
Training takes the prepared manifest + chosen recipe + checkpoint cadence, and produces checkpoints, per-step loss / eval traces, and a final adapter weights blob — the same outputs your Trainer wrote to sft-out/, now first-class records on the experiment.
It runs as a background Job
Long-running work in BrewSLM is a Job persisted to a table. The top-bar notification bell polls /api/jobs/active every ~4 seconds and surfaces progress and outcome, so you don't sit watching a terminal. A watcher Job mirrors the experiment's progress into the bell, and the Job framework holds a strong reference to the running task so it can't be silently garbage-collected mid-run.
The delta-from-baseline curve
In Track 2 you read raw loss numbers scrolling past. BrewSLM plots the live loss curve as a delta from baseline — the untuned base model's loss on the same data is the zero line, and your run is drawn relative to it. That reframes the question from "is 0.58 good?" (which you can't judge in isolation) to "am I beating the baseline, and by how much?" — exactly the comparison Track 1's evaluation lesson insisted on, now live during training.
From Track 2
Everything you learned about reading curves (lesson 2.6) applies here — falling-and-flattening is healthy, a rising eval trace is overfitting. The delta-from-baseline view just makes "better than doing nothing" the explicit y-axis.
Completion and failure are events
When the run finishes it emits training / info with a payload of experiment_id, backend, and final_train_loss. When it fails, it emits a named event so you know exactly what happened:
training (info) # completed: experiment_id, backend, final_train_loss
training_oom # ran out of GPU memory
training_runtime_error # crashed mid-run
training_timeout # exceeded the time budget
training_cancelled # you cancelled it
Those map directly onto the failure modes you learned to diagnose by hand — but now they're typed, logged, and visible in the failure-cluster surface rather than buried in a stack trace. With a trained adapter and its traces recorded, the next stage asks the real question: is it good enough to ship? That's evaluation.
Key idea
Stage 08 is your training loop, watched: a background Job with bell progress, a live delta-from-baseline curve that makes 'beating the base' the y-axis, and named RunEvents for completion and every failure mode.
Key terms
- training Job
- The background, persisted task that runs stage 08; tracked so it can't be GC'd mid-run.
- NotificationBell
- Top-bar surface polling /api/jobs/active (~every 4s) for progress and outcome.
- delta-from-baseline curve
- The live loss plotted relative to the untuned base model's loss as the zero line.
- training RunEvents
- Named events: training (info) on completion; training_oom / _runtime_error / _timeout / _cancelled on failure.
Check yourself
Answers are saved to this browser.