Dataset formats in the wild
After this lesson you can recognise the common SFT dataset shapes — JSONL, completion, chat messages, Alpaca, ShareGPT, classification, extraction — pick the right one for a task, and convert between them so the trainer sees what it expects.
The dataset you receive is almost never the format the trainer wants. You'll meet a Hugging Face dataset in Alpaca format and need chat messages; you'll meet a JSONL of {prompt, completion} rows and need to wrap them in the chat template; you'll meet a tab-separated classification CSV and have to decide whether to convert at all. The conversions are mechanical — and silently wrong if you get them slightly off. This lesson is the field guide.
Why formats matter
SFT trainers (raw Trainer, TRL's SFTTrainer, BrewSLM's task handlers) all eat the same fundamental object — tokenised input_ids with a loss mask — but they accept their source data in different shapes. Picking the right shape means the trainer's built-in chat-template + loss-mask code works correctly without bespoke glue. Picking the wrong shape means you'll fight the trainer or build your own preprocessing — and the training silently optimises the wrong tokens.
JSONL: the lingua franca
JSONL is one JSON object per line. It's not a schema — it's a container. Inside each line, you can put any of the shapes below. JSONL exists because it's streamable (the trainer can read one line at a time), line-recoverable (a corrupt row only kills its line), and easy to inspect with head.
{"prompt": "Classify: I loved it.", "completion": "positive"}
{"prompt": "Classify: Broken on arrival.", "completion": "negative"}
{"prompt": "Classify: Average.", "completion": "neutral"}
When in doubt, write your dataset as JSONL. The format question becomes "what fields inside each line?"
Completion format: simplest
Two fields per row: prompt and completion. Used in Track 2's by-hand SFT and many tutorials. Good for: classification, extraction, anything where the assistant's output is a single block of text and there's no multi-turn structure. Bad for: chat data, multi-turn conversations, role-based prompts.
{"prompt": "Summarise: The flight was delayed three hours...", "completion": "Three-hour delay; complaint resolved with voucher."}
Chat messages format: the standard for chat models
A list of {role, content} objects. This is what an instruct model's apply_chat_template wants. The tokenizer renders the list to a string using the model's chat template (the role markers <|im_start|>user etc.), so you don't hand-format the special tokens — you describe the conversation and let the tokenizer do the rest.
{
"messages": [
{"role": "system", "content": "You are a customer-support agent. Be concise."},
{"role": "user", "content": "My package never arrived."},
{"role": "assistant", "content": "I'm sorry. Can you share the tracking number so I can investigate?"}
]
}
Use this for chat models, multi-turn tasks, anything where role matters. TRL's SFTTrainer accepts it directly.
Alpaca format: legacy, but you'll see it
From Stanford's original instruction-tuning dataset. Three fields: instruction, optional input, output. Lots of older HF datasets ship in this shape.
{
"instruction": "Translate to French.",
"input": "Good morning.",
"output": "Bonjour."
}
Convert it to chat messages: instruction + "\n\n" + input becomes the user content, output becomes the assistant content. The conversion is mechanical; the easy bug is concatenating without the blank line and losing the boundary between instruction and input.
ShareGPT format: messy multi-turn
From early dumps of shared ChatGPT conversations. Multi-turn, with a quirky from/value field naming (instead of role/content).
{
"conversations": [
{"from": "human", "value": "What's the capital of France?"},
{"from": "gpt", "value": "Paris."},
{"from": "human", "value": "And of Germany?"},
{"from": "gpt", "value": "Berlin."}
]
}
Convert to chat messages: rename from: "human" → role: "user", from: "gpt" → role: "assistant", value → content. Done.
Classification format: not a chat shape at all
For pure classification tasks the simplest shape is two fields: text and label. This is what scikit-learn and the HF evaluate library expect, and it's what the BrewSLM Classification handler ingests.
{"text": "Loved the service.", "label": "positive"}
{"text": "Never again.", "label": "negative"}
When you train a chat-style SLM on classification data, you (or the trainer) wrap each row into a chat-messages turn: "Classify this: {text}" / "{label}". When you train a small classification head separately (no SFT), you don't — the classification format is the input directly.
Extraction format: structured outputs
For span extraction (PII, named entities, structured data pulled from text), you need to encode positions. Two common shapes — the span set: {text, spans: [{start, end, label}, ...]}; and the field map: {text, fields: {invoice_number: "...", date: "..."}}. BrewSLM's StructuredExtraction handler supports both via scoring_mode.
{
"text": "Contact jane.doe@example.com or 555-0123",
"spans": [
{"start": 8, "end": 28, "label": "EMAIL"},
{"start": 32, "end": 40, "label": "PHONE"}
]
}
Converting between formats
The mechanical conversion is usually 5–15 lines. The whole point of using a standard format is that the trainer's preprocessing handles tokenisation + chat-template + loss-mask for you. If you find yourself writing those by hand, step back and ask whether a small conversion would let the trainer's built-in path do it correctly.
def alpaca_to_chat(row):
user = row["instruction"] + (f"\n\n{row['input']}" if row.get("input") else "")
return {"messages": [
{"role": "user", "content": user},
{"role": "assistant", "content": row["output"]},
]}
def sharegpt_to_chat(row):
role_map = {"human": "user", "gpt": "assistant", "system": "system"}
return {"messages": [
{"role": role_map[t["from"]], "content": t["value"]}
for t in row["conversations"]
]}
Key idea
The format question is not "what's correct?" — it's "what does my trainer expect, and does my data shape match it?" For chat models, the answer is almost always the chat messages format (a list of {role, content} dicts). For everything else — Alpaca, ShareGPT, completion, CSVs — write a tiny conversion and let the trainer's built-in chat-template + loss-mask code do the rest.
You can now read any dataset you're handed. The next lesson covers a failure mode that bites SFT especially hard once you start training on narrow data: catastrophic forgetting.
Key terms
- JSONL
- One JSON object per line; streamable, line-recoverable, the standard container for SFT datasets.
- Completion format
{prompt, completion}. Simplest shape; what Track 2's by-hand SFT uses. Good for single-turn tasks.- Chat messages format
- A list of
{role, content}dicts that the tokenizer'sapply_chat_templaterenders. The standard for chat models and multi-turn data. - Alpaca format
{instruction, input, output}. From Stanford's original instruction-tuning set; many older HF datasets use it.- ShareGPT format
{conversations: [{from, value}, ...]}. Multi-turn, quirky field names; renamefrom→role,value→contentto convert.- Classification format
{text, label}. The native shape for classification tasks; what scikit-learn and HFevaluateingest directly.- Extraction format
- Span set (
{text, spans: [...]}) or field map ({text, fields: {...}}) for structured extraction tasks.
Check yourself
Answers are saved to this browser.