Experiment tracking with MLflow & W&B
After this lesson you can pick an experiment-tracking tool that matches your situation (MLflow, W&B, or TensorBoard), log the minimum useful set of run metadata, point an HF Trainer at it with one flag, draw the reproducible-vs-replicable line so you know which one you're actually achieving, and answer "why did this regress?" three weeks later with two clicks instead of a forensic excavation.
Track 2 has spent the last several lessons making your evaluation honest — gold sets, real metrics, LLM-as-a-judge, public benchmarks. None of that helps if, three weeks from now, you can't answer the question which run produced that checkpoint? A training run is a vector of decisions: data split, learning rate, batch size, LoRA rank, evaluation cadence, code version, library version, GPU, seed. Forget any one of them and the next comparison is unreliable. Experiment tracking is the bookkeeping layer that makes the vector retrievable. It is also the cheapest insurance in the pipeline; teams that skip it are the ones who later say "the old model was better but we don't know why."
What ad-hoc loses to
The pattern that ends in tears: you tune a fine-tune to a good gate, ship the checkpoint, move on. A month later eval drops. Someone re-runs "the same recipe" — but the dataset has been mutated, a HF library bumped, the random seed shifted, the LR schedule subtly changed. The "regression" might be your code, your data, your tools, or all three. Without a record, the only honest answer is "I don't know."
Tracking turns that question into a join: previous run's config + current run's config → here are the deltas. Two clicks. The first time it saves you a day of forensic work, every cost of setting it up is paid.
The three tools in practice
MLflow — open-source, self-hosted, free
MLflow is the BSD-licensed industry standard. Self-host the tracking server (a single process backed by a database and an artefact store), point runs at it with the MLFLOW_TRACKING_URI env var, get a web UI for runs / metrics / artefacts / model registry. The right pick when you can't send run metadata to a third party (compliance, IP) or when you want a free tool that works at any scale. Slightly less polished UI than W&B; substantially more honest about local-first workflows.
Weights & Biases — hosted, opinionated, free for individuals
W&B is a hosted product with a free tier for personal use and academic projects, and a paid tier for teams. The dashboards, sweeps (hyperparameter search), and report-sharing features are the most polished in the space. The right pick when you want zero-ops tracking and your data can leave the network. Lock-in is real — moving away from W&B usually means re-running everything for the artefacts.
TensorBoard — offline, minimal
TensorBoard is the original, ships with PyTorch, writes event files to disk that any viewer can render. No server, no UI server, no account. Good for single-machine work and for a permanent backup of scalar metrics. The right pick when you just want loss curves on disk and don't need the run-registry / artefact / sweep tooling.
The combination most teams settle on: MLflow for the source of truth (config, artefacts, registry) + TensorBoard event files mirrored in the run directory as a portable backup. W&B becomes the right call if you specifically want sweeps or shared reports.
What to log on every run
The minimum useful set:
- Full config — every hyperparameter, not just the ones you changed today. The dataset path, the model id, the LR, the schedule, the batch size, the LoRA rank/alpha/dropout, the precision, the max seq length, the eval cadence, the seed. Log the same config schema for every run; missing fields make later joins lie.
- Code state — the git SHA at run time, and whether the working tree was dirty (uncommitted changes). A dirty run is forensically useless and you should know which runs were.
- Dataset hash — a hash of the actual data the model trained on (every row, in order). If the dataset gets mutated and you can't reproduce a run, the hash is the first thing you check.
- Per-step loss + eval — training loss and validation loss at the cadence the Trainer logs, plus any task metric you compute (Lessons 2.7, 2.12).
- System metrics — GPU utilisation, GPU memory, host RAM, disk I/O. Knowing whether a run was GPU-bound vs IO-bound vs underutilised matters when you tune for throughput.
- Output artefacts — the final checkpoint (or its path), the LoRA adapter, the eval results JSON, the lm-eval-harness output, the loss curve plot. Tracked artefacts beat "I uploaded it to a bucket somewhere."
- Decoding sweeps you ran — if you tried temperature ∈ {0, 0.3, 0.7} on the same checkpoint, log each as a child run, not as one merged number.
Integration with the HF Trainer
The good news: HF's TrainingArguments has a one-flag integration with all three tools. The boilerplate stays in one place.
# MLflow — point at a local or remote tracking URI
import os
os.environ["MLFLOW_TRACKING_URI"] = "http://localhost:5000" # or your hosted URI
os.environ["MLFLOW_EXPERIMENT_NAME"] = "smollm2-sentiment"
args = TrainingArguments(
output_dir="runs/sft-v3",
report_to="mlflow", # or "wandb", "tensorboard", or a list
run_name="sft-v3-lora-r16", # what you see in the UI
logging_steps=10, # per-step loss cadence
eval_strategy="steps",
eval_steps=200,
save_strategy="steps",
save_steps=200,
# ... the rest of your training config
)
# log dataset hash + code SHA explicitly — Trainer doesn't do this for you
import mlflow, subprocess, hashlib
with mlflow.start_run():
mlflow.log_param("git_sha", subprocess.check_output(["git","rev-parse","HEAD"]).decode().strip())
mlflow.log_param("git_dirty", bool(subprocess.check_output(["git","status","--porcelain"])))
mlflow.log_param("dataset_hash", hashlib.sha256(open(DATA_PATH,"rb").read()).hexdigest()[:12])
trainer.train()
mlflow.log_artifact("runs/sft-v3/eval_results.json")
For W&B, swap report_to="wandb" + os.environ["WANDB_PROJECT"] = "smollm2-sentiment". For TensorBoard, report_to="tensorboard" writes event files to output_dir/runs and tensorboard --logdir runs/ renders them.
The reproducible-vs-replicable distinction
Two words often used interchangeably; tracking is honest only if you respect the difference.
- Reproducible — same inputs → bit-for-bit identical outputs. Same checkpoint file, same eval number to the last decimal. Achievable in ML only with strict pinning of the random seed, the library versions, the GPU architecture (CUDA non-determinism on some ops), the data ordering, and the floating-point precision. Even with all of that, distributed training adds non-determinism most teams don't fully control.
- Replicable — same recipe → equivalent outcome within noise. Re-run the recipe; the gold-set accuracy lands in the same confidence interval. This is what 99% of real ML workflows can actually achieve.
Tracking should target replicability. The metadata you log makes it possible to re-run a recipe and verify the outcome is within noise. Chasing reproducibility — bit-equivalence — usually means pinning the seed for political reasons rather than practical ones, and it doesn't buy you what you want. If your evaluation has a meaningful confidence interval, equivalence-within-noise is the right bar.
Honest beat — a dashboard isn't a workflow
It's easy to set up MLflow or W&B, get a pretty chart, and feel rigorous. The work that actually pays off is the boring part: making sure every team member logs the same config schema, the same dataset hash, the same code SHA — so that joins across runs are reliable. Inconsistent logging hurts more than no logging, because it makes the lookups look answered when they aren't.
Key idea
Pick a tracker (MLflow if you can self-host, W&B if you want zero-ops, TensorBoard as a backup), log config + git SHA + dataset hash + per-step loss + per-step eval + system metrics + artefacts on every run, and target replicability, not reproducibility. The reward isn't the dashboard — it's answering "why did this regress?" three weeks later in two clicks instead of two days.
That closes Track 2 — Hands-on. You can now fine-tune SLMs in code, evaluate them honestly across deterministic, judge-based, and benchmark axes, and keep a record of what you did so the next-week-you can build on the this-week-you. Track 3 takes the same pipeline through BrewSLM and shows what the platform automates and what's worth the trade.
Key terms
- Experiment tracking
- The bookkeeping layer that logs the inputs, intermediate metrics, and outputs of every training run so they can be compared, joined, and reproduced later.
- MLflow
- Open-source experiment-tracking platform; self-hosted; tracks runs, parameters, metrics, artefacts, and a model registry. The BSD-licensed standard.
- Weights & Biases (W&B)
- Hosted experiment-tracking product with the most polished UI, sweeps for hyperparameter search, and team report-sharing. Free tier for individuals; paid for teams.
- TensorBoard
- Offline, file-based experiment viewer that ships with PyTorch. Writes event files any viewer can render. Minimal — no run registry or artefacts.
- Run metadata
- The non-metric record of a run: config, git SHA, dataset hash, library versions, system info. The piece that makes runs joinable later.
- Dataset hash
- A content-addressed hash of the actual dataset bytes used by a run. The fastest way to know whether two runs trained on identical data.
- Reproducible
- Same inputs → bit-for-bit identical outputs. Hard in ML; requires strict pinning of seed, libraries, hardware, and precision.
- Replicable
- Same recipe → equivalent outcome within noise. The practical bar most ML workflows can meet and what experiment tracking should target.
- Sweep
- A hyperparameter search across a configured space (grid, random, Bayesian), each point a child run of a parent. Native to W&B; manual or tooling-based on MLflow.
Check yourself
Answers are saved to this browser.