Data quality I: dedup, balance, leakage, and splits
After this lesson you can prepare a clean SFT dataset: deduplicate, balance classes, split into train/validation/test correctly, and prevent leakage that would make your metrics lie.
If you remember one thing from this track, make it this: data quality beats hyperparameter tuning, almost every time. A clean, representative dataset on default settings usually outperforms a messy dataset with perfectly tuned knobs. This lesson is the unglamorous, decisive work.
Deduplication
Duplicate or near-duplicate examples skew training toward whatever is repeated and inflate your evaluation if the duplicate lands in both train and test. Deduplicate exact and near-exact rows before anything else. It's cheap and it routinely removes a surprising fraction of scraped or merged datasets.
Balance
For classification especially, class balance matters. If 95% of your examples are one label, the model can score 95% by always guessing that label and learn nothing useful. Aim for a reasonable balance, or be deliberate about it (and measure per-class metrics, not just overall accuracy). Imbalance isn't always wrong — but unexamined imbalance is a trap.
The train / validation / test split
Partition your data into three disjoint sets:
- Train — what the model learns from.
- Validation — checked during training to watch for overfitting and pick the best checkpoint; the model never trains on it.
- Test — touched only at the very end, to estimate real-world performance once.
The validation and test sets exist so you measure generalization, not memorization. (The most trusted slice of all — the curated gold set — gets its own lesson next.)
The silent killer: leakage
Data leakage is when information from your test set sneaks into training — the same example (or a near-duplicate, or a row from the same source document) appears in both. The result: gorgeous eval numbers that evaporate in production, because the model was tested on what it had already seen. Deduplicate across splits, and split by the right unit (e.g. by document or user, not by row) so related rows can't straddle the boundary.
Representativeness
Finally, your data should look like the inputs the model will actually face. A dataset of tidy, well-formed examples produces a model that's great on tidy inputs and brittle on the messy reality. Include the hard cases, the edge cases, and the formats real users will send. Representativeness is what makes the metric you compute later actually predict production behavior.
Key idea
Clean data, honest splits, no leakage. Get these right and ordinary training settings will carry you a long way; get them wrong and no amount of tuning will save the run.
Building the harder data
"Representative" data is necessary but not sufficient. A dataset of normal examples teaches the model the average case; what makes a fine-tune actually robust is including the examples that pressure the model's decisions. Four sub-categories worth deliberate effort:
- Hard negatives. Examples that look like a positive but aren't — and vice versa. For a sentiment classifier, "the service was so slow I had time to read a book — loved it" (positive with negative surface markers); for a PII extractor, fake-looking strings that should not be extracted. Hard negatives are where the decision boundary lives; the model only learns where it is if you put examples on both sides.
- Ambiguous cases. Examples a thoughtful human would have to pause on. Label them according to the rule you want enforced ("when ambiguous, prefer X"); the model then learns the rule, not just the obvious cases. Without ambiguous data, the model picks up whichever rule the easy cases happened to imply, which may not be the rule you wanted.
- Refusal data. Examples the model should not answer in the usual way — out-of-scope requests, harmful prompts, requests that need a clarifying question instead. Pair each with the desired refusal phrasing. Without refusal data, an SFT model will try to answer everything, including things it shouldn't.
- OOD (out of distribution) examples. Inputs deliberately outside the trained scope, labelled with the desired "I don't know" or "this isn't a question I handle" response. These teach the model to recognise its own limits, which production inevitably tests.
None of these need to be huge. Twenty hard negatives plus twenty ambiguous cases plus a handful of refusal/OOD examples is enough to move the decision boundary measurably on a small dataset. Skipping them is the most common silent quality cap on SFT projects.
Key terms
- Deduplication
- Removing exact/near-duplicate examples before training and across splits.
- Class balance
- Avoiding one label dominating, so accuracy reflects real skill; measure per-class metrics.
- Train/validation/test split
- Disjoint sets for learning, in-training checking, and a final one-time estimate.
- Data leakage
- Test information appearing in training (duplicates, same-source rows), inflating metrics dishonestly.
- Representativeness
- Training data resembling the real inputs the model will face, including hard/edge cases.
- Hard negative
- An example that resembles the other class on the surface; placed on the correct side of the decision boundary to teach the model where the line is.
- Ambiguous case
- An example a thoughtful human would have to pause on; labelled per a stated rule so the model learns the rule, not the obvious cases only.
- Refusal data
- Examples the model should not answer the usual way — out-of-scope, harmful, or needing clarification — paired with the desired refusal phrasing.
- OOD (out of distribution)
- Inputs outside the trained scope, labelled with the desired "I don't know" so the model learns its limits.
Check yourself
Answers are saved to this browser.