Track 1 · SFT fundamentals · Lesson 7

Data quality I: dedup, balance, leakage, and splits

After this lesson you can prepare a clean SFT dataset: deduplicate, balance classes, split into train/validation/test correctly, and prevent leakage that would make your metrics lie.

Level: beginner Read time: ~10 min Prerequisites: Tokenization in practice: padding, truncation, packing

If you remember one thing from this track, make it this: data quality beats hyperparameter tuning, almost every time. A clean, representative dataset on default settings usually outperforms a messy dataset with perfectly tuned knobs. This lesson is the unglamorous, decisive work.

Deduplication

Duplicate or near-duplicate examples skew training toward whatever is repeated and inflate your evaluation if the duplicate lands in both train and test. Deduplicate exact and near-exact rows before anything else. It's cheap and it routinely removes a surprising fraction of scraped or merged datasets.

Balance

For classification especially, class balance matters. If 95% of your examples are one label, the model can score 95% by always guessing that label and learn nothing useful. Aim for a reasonable balance, or be deliberate about it (and measure per-class metrics, not just overall accuracy). Imbalance isn't always wrong — but unexamined imbalance is a trap.

The train / validation / test split

Partition your data into three disjoint sets:

The validation and test sets exist so you measure generalization, not memorization. (The most trusted slice of all — the curated gold set — gets its own lesson next.)

The silent killer: leakage

Data leakage is when information from your test set sneaks into training — the same example (or a near-duplicate, or a row from the same source document) appears in both. The result: gorgeous eval numbers that evaporate in production, because the model was tested on what it had already seen. Deduplicate across splits, and split by the right unit (e.g. by document or user, not by row) so related rows can't straddle the boundary.

Representativeness

Finally, your data should look like the inputs the model will actually face. A dataset of tidy, well-formed examples produces a model that's great on tidy inputs and brittle on the messy reality. Include the hard cases, the edge cases, and the formats real users will send. Representativeness is what makes the metric you compute later actually predict production behavior.

Key idea

Clean data, honest splits, no leakage. Get these right and ordinary training settings will carry you a long way; get them wrong and no amount of tuning will save the run.

Building the harder data

"Representative" data is necessary but not sufficient. A dataset of normal examples teaches the model the average case; what makes a fine-tune actually robust is including the examples that pressure the model's decisions. Four sub-categories worth deliberate effort:

None of these need to be huge. Twenty hard negatives plus twenty ambiguous cases plus a handful of refusal/OOD examples is enough to move the decision boundary measurably on a small dataset. Skipping them is the most common silent quality cap on SFT projects.

Key terms

Deduplication
Removing exact/near-duplicate examples before training and across splits.
Class balance
Avoiding one label dominating, so accuracy reflects real skill; measure per-class metrics.
Train/validation/test split
Disjoint sets for learning, in-training checking, and a final one-time estimate.
Data leakage
Test information appearing in training (duplicates, same-source rows), inflating metrics dishonestly.
Representativeness
Training data resembling the real inputs the model will face, including hard/edge cases.
Hard negative
An example that resembles the other class on the surface; placed on the correct side of the decision boundary to teach the model where the line is.
Ambiguous case
An example a thoughtful human would have to pause on; labelled per a stated rule so the model learns the rule, not the obvious cases only.
Refusal data
Examples the model should not answer the usual way — out-of-scope, harmful, or needing clarification — paired with the desired refusal phrasing.
OOD (out of distribution)
Inputs outside the trained scope, labelled with the desired "I don't know" so the model learns its limits.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.