What defines a gold set?

A curated, verified set the model never trains on, used to measure quality

Why must the gold set stay fixed across experiments?

So results are comparable run-to-run

Track 1 · SFT fundamentals · Lesson 8

Data quality II: gold sets

After this lesson you can explain what a gold set is, how it differs from the test split, how to build a trustworthy one, and why every quality decision should be answered against it.

Level: beginner Read time: ~8 min Prerequisites: Data quality I: dedup, balance, leakage, and splits

The previous lesson split your data so you could measure generalization. This lesson is about the most important measurement artifact of all: the gold set. If your project has one thing that is beyond reproach, it should be this.

What a gold set is

A gold set is a small, carefully curated collection of examples with known-correct answers (ground truth), that the model never trains on. Its only job is to answer the question "is the model good — and is this version better than the last?" Where a test split is often sampled automatically, a gold set is deliberately constructed and trusted: every label checked, every edge case you care about deliberately included.

Why it's your north star

Every meaningful decision in a fine-tuning project is answered against the gold set: Did this data change help? Is checkpoint B better than A? Is it good enough to ship? Is the larger model worth its cost? Without a fixed, trustworthy gold set, "better" becomes a vibe. With one, it becomes a number you can defend.

Key idea

The gold set is the project's source of truth. Keep it fixed across experiments so results are comparable, keep it clean so the numbers mean something, and never, ever train on it — the moment you do, your metrics become fiction.

Building a trustworthy one

Coverage — include the common cases and the hard/edge cases that matter; the gold set should probe where the model is likely to fail, not just where it's easy.
Label quality — verify every answer; a gold set with wrong labels gives confidently wrong guidance. It's better to have 100 impeccable examples than 1,000 sloppy ones.
Size — big enough that the metric is stable run-to-run, small enough to curate by hand and inspect failures individually. A few dozen to a few hundred is common for a narrow task.
Isolation — store it separately and guard it from ever entering training (and from leakage via near-duplicates).

Gold set vs test split

They overlap in spirit but differ in trust and use. The test split is a held-out sample for a final, broad estimate. The gold set is a hand-curated instrument you consult repeatedly to steer the project — the thing you stare at when deciding what to fix next. In practice the gold set is where data-centric iteration (Track 0, Lesson 9) gets its signal: you read the gold-set failures, fix the training data accordingly, and retrain.

With clean data and a trustworthy gold set in hand, you're ready to actually train. The next lessons open up the training loop, the loss, and the knobs that govern it.

Key terms

Gold set: A small, curated, never-trained-on set with verified answers; the source of truth for quality.
Ground truth: The known-correct answer for an example.
Held-out evaluation: Measuring on data the model didn't train on, to estimate generalization.
Coverage: Including common and hard/edge cases so the gold set probes real weaknesses.
Label quality: The correctness of the gold answers; wrong labels give confidently wrong guidance.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

What a gold set is

Why it's your north star

Building a trustworthy one

Gold set vs test split

Key terms

Check yourself

Related lessons