Why does ad-hoc training lose to bookkeeping every time?

Without tracking, every run is one-shot. The first time someone asks 'why did this regress?' three weeks after the fact, you can't reconstruct the config, the data split, or the code state — and you redo the comparison from scratch. With tracking, the question is two clicks.

What is the reproducible-vs-replicable distinction?

Reproducible means same input → bit-for-bit identical output (same model file, same eval number). Replicable means same recipe → equivalent outcome within noise. Most ML can be replicable but not reproducible without strict seed/version/library pinning. The distinction matters because tracking targets replicability — comparing runs — not bit-equivalence.

What's the minimum useful set of things to log on every run?

Full config (every hyperparameter), the git SHA of the code at run time, a hash of the dataset, per-step loss and eval, system metrics (GPU utilisation, memory), and the output artefacts (final checkpoint or its path). Anything less and you can't reliably compare two runs.

Track 2 · Hands-on · Lesson 18

Experiment tracking with MLflow & W&B

After this lesson you can pick an experiment-tracking tool that matches your situation (MLflow, W&B, or TensorBoard), log the minimum useful set of run metadata, point an HF Trainer at it with one flag, draw the reproducible-vs-replicable line so you know which one you're actually achieving, and answer "why did this regress?" three weeks later with two clicks instead of a forensic excavation.

Level: intermediate Read time: ~10 min Prerequisites: Public benchmarks & lm-eval-harness

Track 2 has spent the last several lessons making your evaluation honest — gold sets, real metrics, LLM-as-a-judge, public benchmarks. None of that helps if, three weeks from now, you can't answer the question which run produced that checkpoint? A training run is a vector of decisions: data split, learning rate, batch size, LoRA rank, evaluation cadence, code version, library version, GPU, seed. Forget any one of them and the next comparison is unreliable. Experiment tracking is the bookkeeping layer that makes the vector retrievable. It is also the cheapest insurance in the pipeline; teams that skip it are the ones who later say "the old model was better but we don't know why."

What ad-hoc loses to

The pattern that ends in tears: you tune a fine-tune to a good gate, ship the checkpoint, move on. A month later eval drops. Someone re-runs "the same recipe" — but the dataset has been mutated, a HF library bumped, the random seed shifted, the LR schedule subtly changed. The "regression" might be your code, your data, your tools, or all three. Without a record, the only honest answer is "I don't know."

Tracking turns that question into a join: previous run's config + current run's config → here are the deltas. Two clicks. The first time it saves you a day of forensic work, every cost of setting it up is paid.

The three tools in practice

MLflow — open-source, self-hosted, free

MLflow is the BSD-licensed industry standard. Self-host the tracking server (a single process backed by a database and an artefact store), point runs at it with the MLFLOW_TRACKING_URI env var, get a web UI for runs / metrics / artefacts / model registry. The right pick when you can't send run metadata to a third party (compliance, IP) or when you want a free tool that works at any scale. Slightly less polished UI than W&B; substantially more honest about local-first workflows.

Weights & Biases — hosted, opinionated, free for individuals

W&B is a hosted product with a free tier for personal use and academic projects, and a paid tier for teams. The dashboards, sweeps (hyperparameter search), and report-sharing features are the most polished in the space. The right pick when you want zero-ops tracking and your data can leave the network. Lock-in is real — moving away from W&B usually means re-running everything for the artefacts.

TensorBoard — offline, minimal

TensorBoard is the original, ships with PyTorch, writes event files to disk that any viewer can render. No server, no UI server, no account. Good for single-machine work and for a permanent backup of scalar metrics. The right pick when you just want loss curves on disk and don't need the run-registry / artefact / sweep tooling.

The combination most teams settle on: MLflow for the source of truth (config, artefacts, registry) + TensorBoard event files mirrored in the run directory as a portable backup. W&B becomes the right call if you specifically want sweeps or shared reports.

What to log on every run

The minimum useful set:

Full config — every hyperparameter, not just the ones you changed today. The dataset path, the model id, the LR, the schedule, the batch size, the LoRA rank/alpha/dropout, the precision, the max seq length, the eval cadence, the seed. Log the same config schema for every run; missing fields make later joins lie.
Code state — the git SHA at run time, and whether the working tree was dirty (uncommitted changes). A dirty run is forensically useless and you should know which runs were.
Dataset hash — a hash of the actual data the model trained on (every row, in order). If the dataset gets mutated and you can't reproduce a run, the hash is the first thing you check.
Per-step loss + eval — training loss and validation loss at the cadence the Trainer logs, plus any task metric you compute (Lessons 2.7, 2.12).
System metrics — GPU utilisation, GPU memory, host RAM, disk I/O. Knowing whether a run was GPU-bound vs IO-bound vs underutilised matters when you tune for throughput.
Output artefacts — the final checkpoint (or its path), the LoRA adapter, the eval results JSON, the lm-eval-harness output, the loss curve plot. Tracked artefacts beat "I uploaded it to a bucket somewhere."
Decoding sweeps you ran — if you tried temperature ∈ {0, 0.3, 0.7} on the same checkpoint, log each as a child run, not as one merged number.

Integration with the HF Trainer

The good news: HF's TrainingArguments has a one-flag integration with all three tools. The boilerplate stays in one place.

# MLflow — point at a local or remote tracking URI
import os
os.environ["MLFLOW_TRACKING_URI"] = "http://localhost:5000"  # or your hosted URI
os.environ["MLFLOW_EXPERIMENT_NAME"] = "smollm2-sentiment"

args = TrainingArguments(
    output_dir="runs/sft-v3",
    report_to="mlflow",                         # or "wandb", "tensorboard", or a list
    run_name="sft-v3-lora-r16",                 # what you see in the UI
    logging_steps=10,                           # per-step loss cadence
    eval_strategy="steps",
    eval_steps=200,
    save_strategy="steps",
    save_steps=200,
    # ... the rest of your training config
)

# log dataset hash + code SHA explicitly — Trainer doesn't do this for you
import mlflow, subprocess, hashlib
with mlflow.start_run():
    mlflow.log_param("git_sha", subprocess.check_output(["git","rev-parse","HEAD"]).decode().strip())
    mlflow.log_param("git_dirty", bool(subprocess.check_output(["git","status","--porcelain"])))
    mlflow.log_param("dataset_hash", hashlib.sha256(open(DATA_PATH,"rb").read()).hexdigest()[:12])
    trainer.train()
    mlflow.log_artifact("runs/sft-v3/eval_results.json")

For W&B, swap report_to="wandb" + os.environ["WANDB_PROJECT"] = "smollm2-sentiment". For TensorBoard, report_to="tensorboard" writes event files to output_dir/runs and tensorboard --logdir runs/ renders them.

The reproducible-vs-replicable distinction

Two words often used interchangeably; tracking is honest only if you respect the difference.

Reproducible — same inputs → bit-for-bit identical outputs. Same checkpoint file, same eval number to the last decimal. Achievable in ML only with strict pinning of the random seed, the library versions, the GPU architecture (CUDA non-determinism on some ops), the data ordering, and the floating-point precision. Even with all of that, distributed training adds non-determinism most teams don't fully control.
Replicable — same recipe → equivalent outcome within noise. Re-run the recipe; the gold-set accuracy lands in the same confidence interval. This is what 99% of real ML workflows can actually achieve.

Tracking should target replicability. The metadata you log makes it possible to re-run a recipe and verify the outcome is within noise. Chasing reproducibility — bit-equivalence — usually means pinning the seed for political reasons rather than practical ones, and it doesn't buy you what you want. If your evaluation has a meaningful confidence interval, equivalence-within-noise is the right bar.

Honest beat — a dashboard isn't a workflow

It's easy to set up MLflow or W&B, get a pretty chart, and feel rigorous. The work that actually pays off is the boring part: making sure every team member logs the same config schema, the same dataset hash, the same code SHA — so that joins across runs are reliable. Inconsistent logging hurts more than no logging, because it makes the lookups look answered when they aren't.

Key idea

Pick a tracker (MLflow if you can self-host, W&B if you want zero-ops, TensorBoard as a backup), log config + git SHA + dataset hash + per-step loss + per-step eval + system metrics + artefacts on every run, and target replicability, not reproducibility. The reward isn't the dashboard — it's answering "why did this regress?" three weeks later in two clicks instead of two days.

That closes Track 2 — Hands-on. You can now fine-tune SLMs in code, evaluate them honestly across deterministic, judge-based, and benchmark axes, and keep a record of what you did so the next-week-you can build on the this-week-you. Track 3 takes the same pipeline through BrewSLM and shows what the platform automates and what's worth the trade.

Key terms

Experiment tracking: The bookkeeping layer that logs the inputs, intermediate metrics, and outputs of every training run so they can be compared, joined, and reproduced later.
MLflow: Open-source experiment-tracking platform; self-hosted; tracks runs, parameters, metrics, artefacts, and a model registry. The BSD-licensed standard.
Weights & Biases (W&B): Hosted experiment-tracking product with the most polished UI, sweeps for hyperparameter search, and team report-sharing. Free tier for individuals; paid for teams.
TensorBoard: Offline, file-based experiment viewer that ships with PyTorch. Writes event files any viewer can render. Minimal — no run registry or artefacts.
Run metadata: The non-metric record of a run: config, git SHA, dataset hash, library versions, system info. The piece that makes runs joinable later.
Dataset hash: A content-addressed hash of the actual dataset bytes used by a run. The fastest way to know whether two runs trained on identical data.
Reproducible: Same inputs → bit-for-bit identical outputs. Hard in ML; requires strict pinning of seed, libraries, hardware, and precision.
Replicable: Same recipe → equivalent outcome within noise. The practical bar most ML workflows can meet and what experiment tracking should target.
Sweep: A hyperparameter search across a configured space (grid, random, Bayesian), each point a child run of a parent. Native to W&B; manual or tooling-based on MLflow.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.