Why does the task shape matter so much?

It determines the data format, the loss target, and the evaluation metric

How should named-entity extraction (NER) be scored?

Span-set matching on (type, start, end)

A generated summary scored only by exact-match against one reference will…

Look terrible even when good, because wording differs

Track 1 · SFT fundamentals · Lesson 5

Task shapes: classification, QA, extraction, summarization, chat

After this lesson you can identify the common SFT task shapes, describe how each frames its data and completion, and pick an evaluation metric that actually fits the shape.

Level: beginner Read time: ~9 min Prerequisites: Chat templates & special tokens

Before you collect a single example, decide your task shape. The shape determines three things at once: how the data is formatted, what the completion looks like, and which metric tells you whether the model is any good. BrewSLM organizes its whole pipeline around shapes for exactly this reason.

The common shapes

Classification — input → one of N fixed labels. Completion is short (the label). Metric: accuracy, per-class precision/recall/F1, confusion matrix.
QA / instruction following — a question (often with context) → a free-text answer. Metric: exact match / F1 against a reference, or an LLM judge for open answers.
Extraction / NER — input → a set of typed spans (e.g. names, dates, PII). Completion is a structured list of (type, start, end) or the extracted strings. Metric: span-set matching, not classification F1.
Structured output — input → a JSON object with specific fields. Metric: valid-JSON rate plus per-field correctness.
Summarization — long input → short, faithful summary. Metric: ROUGE for overlap and a faithfulness check (did it invent anything?).
Chat — multi-turn dialogue with a persona/behavior. Metric: task-specific or an LLM judge.

(There's also the preference shape — (prompt, chosen, rejected) — used by DPO, which we covered as an objective and revisit in Track 4.)

The shape determines the completion

Notice how the completion differs: a single word for classification, a structured JSON for extraction, a paragraph for summarization. That shape flows straight into the loss mask (what you train to produce) and the chat template (how it's framed). Picking the shape is the first design decision of a fine-tuning project, not an afterthought.

Key idea

The metric must match the shape. Scoring NER with classification F1, or a generated summary with exact-match, produces numbers that look precise and mean nothing. "Your F1 is 4% because the reference was one word" is a measurement bug, not a model failure.

Choosing your shape

Map your real problem onto the closest shape, and if it doesn't fit cleanly, reframe it until it does — a well-chosen shape makes data collection, training, and evaluation all straightforward, while a forced one fights you at every stage. Most business tasks reduce to classification, extraction, or QA. Once the shape is fixed, the next thing that determines success is the quality of the data itself — the subject of the next two lessons.

Key terms

Task shape: The structural form of a task (classification, QA, extraction, summarization, chat) that sets the data format, loss, and metric.
Classification: Input → one of N fixed labels.
Extraction / NER: Input → a set of typed spans; scored by span-set matching.
Structured output: Input → a JSON object with specific fields; scored by valid-JSON rate + field correctness.
Summarization: Long input → short faithful summary; scored by overlap (ROUGE) + faithfulness.
Metric-shape fit: Choosing an evaluation metric that matches the task shape so the numbers are meaningful.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.