Build a tiny SFT dataset
After this lesson you can build an SFT dataset as (prompt, completion) pairs, load it into a datasets.Dataset, check class balance, and make a train/validation split.
The model is loaded; now it needs examples. We'll use a tiny, hand-built sentiment classification dataset so the whole pipeline is fast and inspectable. In a real project you'd have hundreds to thousands of rows; the mechanics are identical.
The shape: (prompt, completion)
From Track 1: an SFT example is a prompt paired with the target completion. For classification the completion is just the label. We keep the prompt instruction identical across examples so the model learns the task, not the wording.
raw = [
{"prompt": "Classify the sentiment as positive or negative: I loved this movie.", "completion": "positive"},
{"prompt": "Classify the sentiment as positive or negative: A complete waste of time.", "completion": "negative"},
{"prompt": "Classify the sentiment as positive or negative: Best purchase I've made all year.", "completion": "positive"},
{"prompt": "Classify the sentiment as positive or negative: It broke after one use.", "completion": "negative"},
{"prompt": "Classify the sentiment as positive or negative: Absolutely delightful from start to finish.", "completion": "positive"},
{"prompt": "Classify the sentiment as positive or negative: I want a refund.", "completion": "negative"},
# ... in practice, hundreds more, covering the hard and edge cases
]
Wrap it in a datasets.Dataset
The datasets library gives us a fast, memory-mapped table with .map(), .filter(), and splitting — the standard container the Trainer expects.
from datasets import Dataset
ds = Dataset.from_list(raw)
print(ds)
print(ds[0])
Check class balance
Track 1 warned that imbalance lets a model cheat. Check it in one line:
from collections import Counter
print(Counter(ds["completion"])) # e.g. Counter({'positive': 3, 'negative': 3})
Apply what you learned
Before training, also deduplicate (no identical prompts), confirm labels are correct, and make sure the examples resemble the inputs you'll really see. Garbage in, garbage out — the data lessons of Track 1 are where most of your quality comes from.
Split into train and validation
We hold out a slice to watch for overfitting during training (Track 1). datasets makes this one call:
split = ds.train_test_split(test_size=0.2, seed=42)
train_ds, val_ds = split["train"], split["test"]
print(len(train_ds), "train /", len(val_ds), "val")
Keep a separate, hand-checked gold set too (Track 1, Lesson 8) — a handful of examples you will never train on, used in Lesson 2.7 to judge whether fine-tuning actually helped. With data in hand, the next step is turning these strings into the token tensors the model consumes.
Key idea
Data work is the real work. A Dataset of clean, balanced, representative (prompt, completion) pairs — plus an untouched gold set — is what makes the rest of the pipeline meaningful.
Key terms
- datasets.Dataset
- A fast, memory-mapped table with map/filter/split that the Trainer consumes.
- from_list
- Builds a Dataset from a list of dicts.
- class balance
- Keeping labels roughly even so accuracy reflects real skill.
- train_test_split
- Splits a Dataset into train and held-out portions.
- gold set
- A small, never-trained-on set with verified answers, used to judge quality.
Check yourself
Answers are saved to this browser.