Track 1 · SFT fundamentals · Lesson 8

Data quality II: gold sets

After this lesson you can explain what a gold set is, how it differs from the test split, how to build a trustworthy one, and why every quality decision should be answered against it.

Level: beginner Read time: ~8 min Prerequisites: Data quality I: dedup, balance, leakage, and splits

The previous lesson split your data so you could measure generalization. This lesson is about the most important measurement artifact of all: the gold set. If your project has one thing that is beyond reproach, it should be this.

What a gold set is

A gold set is a small, carefully curated collection of examples with known-correct answers (ground truth), that the model never trains on. Its only job is to answer the question "is the model good — and is this version better than the last?" Where a test split is often sampled automatically, a gold set is deliberately constructed and trusted: every label checked, every edge case you care about deliberately included.

Why it's your north star

Every meaningful decision in a fine-tuning project is answered against the gold set: Did this data change help? Is checkpoint B better than A? Is it good enough to ship? Is the larger model worth its cost? Without a fixed, trustworthy gold set, "better" becomes a vibe. With one, it becomes a number you can defend.

Key idea

The gold set is the project's source of truth. Keep it fixed across experiments so results are comparable, keep it clean so the numbers mean something, and never, ever train on it — the moment you do, your metrics become fiction.

Building a trustworthy one

Gold set vs test split

They overlap in spirit but differ in trust and use. The test split is a held-out sample for a final, broad estimate. The gold set is a hand-curated instrument you consult repeatedly to steer the project — the thing you stare at when deciding what to fix next. In practice the gold set is where data-centric iteration (Track 0, Lesson 9) gets its signal: you read the gold-set failures, fix the training data accordingly, and retrain.

With clean data and a trustworthy gold set in hand, you're ready to actually train. The next lessons open up the training loop, the loss, and the knobs that govern it.

Key terms

Gold set
A small, curated, never-trained-on set with verified answers; the source of truth for quality.
Ground truth
The known-correct answer for an example.
Held-out evaluation
Measuring on data the model didn't train on, to estimate generalization.
Coverage
Including common and hard/edge cases so the gold set probes real weaknesses.
Label quality
The correctness of the gold answers; wrong labels give confidently wrong guidance.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.