Why keep the instruction wording identical across examples?

So the model learns the task, not incidental wording

What does ds.train_test_split(test_size=0.2) give you?

A train portion and a held-out portion to watch overfitting

The gold set should be…

A small, verified set you never train on

Track 2 · Hands-on · Lesson 3

Build a tiny SFT dataset

After this lesson you can build an SFT dataset as (prompt, completion) pairs, load it into a datasets.Dataset, check class balance, and make a train/validation split.

Level: beginner Read time: ~9 min Prerequisites: Load a base model and tokenizer

The model is loaded; now it needs examples. We'll use a tiny, hand-built sentiment classification dataset so the whole pipeline is fast and inspectable. In a real project you'd have hundreds to thousands of rows; the mechanics are identical.

The shape: (prompt, completion)

From Track 1: an SFT example is a prompt paired with the target completion. For classification the completion is just the label. We keep the prompt instruction identical across examples so the model learns the task, not the wording.

raw = [
    {"prompt": "Classify the sentiment as positive or negative: I loved this movie.", "completion": "positive"},
    {"prompt": "Classify the sentiment as positive or negative: A complete waste of time.", "completion": "negative"},
    {"prompt": "Classify the sentiment as positive or negative: Best purchase I've made all year.", "completion": "positive"},
    {"prompt": "Classify the sentiment as positive or negative: It broke after one use.", "completion": "negative"},
    {"prompt": "Classify the sentiment as positive or negative: Absolutely delightful from start to finish.", "completion": "positive"},
    {"prompt": "Classify the sentiment as positive or negative: I want a refund.", "completion": "negative"},
    # ... in practice, hundreds more, covering the hard and edge cases
]

Wrap it in a datasets.Dataset

The datasets library gives us a fast, memory-mapped table with .map(), .filter(), and splitting — the standard container the Trainer expects.

from datasets import Dataset

ds = Dataset.from_list(raw)
print(ds)
print(ds[0])

Check class balance

Track 1 warned that imbalance lets a model cheat. Check it in one line:

from collections import Counter
print(Counter(ds["completion"]))   # e.g. Counter({'positive': 3, 'negative': 3})

Apply what you learned

Before training, also deduplicate (no identical prompts), confirm labels are correct, and make sure the examples resemble the inputs you'll really see. Garbage in, garbage out — the data lessons of Track 1 are where most of your quality comes from.

Split into train and validation

We hold out a slice to watch for overfitting during training (Track 1). datasets makes this one call:

split = ds.train_test_split(test_size=0.2, seed=42)
train_ds, val_ds = split["train"], split["test"]
print(len(train_ds), "train /", len(val_ds), "val")

Keep a separate, hand-checked gold set too (Track 1, Lesson 8) — a handful of examples you will never train on, used in Lesson 2.7 to judge whether fine-tuning actually helped. With data in hand, the next step is turning these strings into the token tensors the model consumes.

Key idea

Data work is the real work. A Dataset of clean, balanced, representative (prompt, completion) pairs — plus an untouched gold set — is what makes the rest of the pipeline meaningful.

Key terms

datasets.Dataset: A fast, memory-mapped table with map/filter/split that the Trainer consumes.
from_list: Builds a Dataset from a list of dicts.
class balance: Keeping labels roughly even so accuracy reflects real skill.
train_test_split: Splits a Dataset into train and held-out portions.
gold set: A small, never-trained-on set with verified answers, used to judge quality.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.

The shape: (prompt, completion)

Wrap it in a datasets.Dataset

Check class balance

Split into train and validation

Key terms

Check yourself

Related lessons