Track 2 · Hands-on · Lesson 18

Experiment tracking with MLflow & W&B

After this lesson you can pick an experiment-tracking tool that matches your situation (MLflow, W&B, or TensorBoard), log the minimum useful set of run metadata, point an HF Trainer at it with one flag, draw the reproducible-vs-replicable line so you know which one you're actually achieving, and answer "why did this regress?" three weeks later with two clicks instead of a forensic excavation.

Level: intermediate Read time: ~10 min Prerequisites: Public benchmarks & lm-eval-harness

Track 2 has spent the last several lessons making your evaluation honest — gold sets, real metrics, LLM-as-a-judge, public benchmarks. None of that helps if, three weeks from now, you can't answer the question which run produced that checkpoint? A training run is a vector of decisions: data split, learning rate, batch size, LoRA rank, evaluation cadence, code version, library version, GPU, seed. Forget any one of them and the next comparison is unreliable. Experiment tracking is the bookkeeping layer that makes the vector retrievable. It is also the cheapest insurance in the pipeline; teams that skip it are the ones who later say "the old model was better but we don't know why."

What ad-hoc loses to

The pattern that ends in tears: you tune a fine-tune to a good gate, ship the checkpoint, move on. A month later eval drops. Someone re-runs "the same recipe" — but the dataset has been mutated, a HF library bumped, the random seed shifted, the LR schedule subtly changed. The "regression" might be your code, your data, your tools, or all three. Without a record, the only honest answer is "I don't know."

Tracking turns that question into a join: previous run's config + current run's config → here are the deltas. Two clicks. The first time it saves you a day of forensic work, every cost of setting it up is paid.

The three tools in practice

MLflow — open-source, self-hosted, free

MLflow is the BSD-licensed industry standard. Self-host the tracking server (a single process backed by a database and an artefact store), point runs at it with the MLFLOW_TRACKING_URI env var, get a web UI for runs / metrics / artefacts / model registry. The right pick when you can't send run metadata to a third party (compliance, IP) or when you want a free tool that works at any scale. Slightly less polished UI than W&B; substantially more honest about local-first workflows.

Weights & Biases — hosted, opinionated, free for individuals

W&B is a hosted product with a free tier for personal use and academic projects, and a paid tier for teams. The dashboards, sweeps (hyperparameter search), and report-sharing features are the most polished in the space. The right pick when you want zero-ops tracking and your data can leave the network. Lock-in is real — moving away from W&B usually means re-running everything for the artefacts.

TensorBoard — offline, minimal

TensorBoard is the original, ships with PyTorch, writes event files to disk that any viewer can render. No server, no UI server, no account. Good for single-machine work and for a permanent backup of scalar metrics. The right pick when you just want loss curves on disk and don't need the run-registry / artefact / sweep tooling.

The combination most teams settle on: MLflow for the source of truth (config, artefacts, registry) + TensorBoard event files mirrored in the run directory as a portable backup. W&B becomes the right call if you specifically want sweeps or shared reports.

What to log on every run

The minimum useful set:

Integration with the HF Trainer

The good news: HF's TrainingArguments has a one-flag integration with all three tools. The boilerplate stays in one place.

# MLflow — point at a local or remote tracking URI
import os
os.environ["MLFLOW_TRACKING_URI"] = "http://localhost:5000"  # or your hosted URI
os.environ["MLFLOW_EXPERIMENT_NAME"] = "smollm2-sentiment"

args = TrainingArguments(
    output_dir="runs/sft-v3",
    report_to="mlflow",                         # or "wandb", "tensorboard", or a list
    run_name="sft-v3-lora-r16",                 # what you see in the UI
    logging_steps=10,                           # per-step loss cadence
    eval_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=200,
    # ... the rest of your training config
)

# log dataset hash + code SHA explicitly — Trainer doesn't do this for you
import mlflow, subprocess, hashlib
with mlflow.start_run():
    mlflow.log_param("git_sha", subprocess.check_output(["git","rev-parse","HEAD"]).decode().strip())
    mlflow.log_param("git_dirty", bool(subprocess.check_output(["git","status","--porcelain"])))
    mlflow.log_param("dataset_hash", hashlib.sha256(open(DATA_PATH,"rb").read()).hexdigest()[:12])
    trainer.train()
    mlflow.log_artifact("runs/sft-v3/eval_results.json")

For W&B, swap report_to="wandb" + os.environ["WANDB_PROJECT"] = "smollm2-sentiment". For TensorBoard, report_to="tensorboard" writes event files to output_dir/runs and tensorboard --logdir runs/ renders them.

The reproducible-vs-replicable distinction

Two words often used interchangeably; tracking is honest only if you respect the difference.

Tracking should target replicability. The metadata you log makes it possible to re-run a recipe and verify the outcome is within noise. Chasing reproducibility — bit-equivalence — usually means pinning the seed for political reasons rather than practical ones, and it doesn't buy you what you want. If your evaluation has a meaningful confidence interval, equivalence-within-noise is the right bar.

Honest beat — a dashboard isn't a workflow

It's easy to set up MLflow or W&B, get a pretty chart, and feel rigorous. The work that actually pays off is the boring part: making sure every team member logs the same config schema, the same dataset hash, the same code SHA — so that joins across runs are reliable. Inconsistent logging hurts more than no logging, because it makes the lookups look answered when they aren't.

Key idea

Pick a tracker (MLflow if you can self-host, W&B if you want zero-ops, TensorBoard as a backup), log config + git SHA + dataset hash + per-step loss + per-step eval + system metrics + artefacts on every run, and target replicability, not reproducibility. The reward isn't the dashboard — it's answering "why did this regress?" three weeks later in two clicks instead of two days.

That closes Track 2 — Hands-on. You can now fine-tune SLMs in code, evaluate them honestly across deterministic, judge-based, and benchmark axes, and keep a record of what you did so the next-week-you can build on the this-week-you. Track 3 takes the same pipeline through BrewSLM and shows what the platform automates and what's worth the trade.

Key terms

Experiment tracking
The bookkeeping layer that logs the inputs, intermediate metrics, and outputs of every training run so they can be compared, joined, and reproduced later.
MLflow
Open-source experiment-tracking platform; self-hosted; tracks runs, parameters, metrics, artefacts, and a model registry. The BSD-licensed standard.
Weights & Biases (W&B)
Hosted experiment-tracking product with the most polished UI, sweeps for hyperparameter search, and team report-sharing. Free tier for individuals; paid for teams.
TensorBoard
Offline, file-based experiment viewer that ships with PyTorch. Writes event files any viewer can render. Minimal — no run registry or artefacts.
Run metadata
The non-metric record of a run: config, git SHA, dataset hash, library versions, system info. The piece that makes runs joinable later.
Dataset hash
A content-addressed hash of the actual dataset bytes used by a run. The fastest way to know whether two runs trained on identical data.
Reproducible
Same inputs → bit-for-bit identical outputs. Hard in ML; requires strict pinning of seed, libraries, hardware, and precision.
Replicable
Same recipe → equivalent outcome within noise. The practical bar most ML workflows can meet and what experiment tracking should target.
Sweep
A hyperparameter search across a configured space (grid, random, Bayesian), each point a child run of a parent. Native to W&B; manual or tooling-based on MLflow.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.