For a multi-class intent classifier with class imbalance, which metric should you lead with?

Macro F1 — equal weight per class, so a heavily imbalanced eval doesn't flatter the model on the majority class.

For an FAQ assistant where the answer must come from a knowledge base, the right architecture is…

Retrieval + a small fine-tuned model (RAG + SFT). Fine-tuning teaches behaviour (concise, polite, refuses out-of-scope); retrieval supplies the facts.

Track 2 · Hands-on · Lesson 15

Project gallery: 6 SLM use cases as recipes

After this lesson you can pick a concrete SLM project from six common shapes — classification, structured extraction, generation with knowledge, tool-call generation — express it as a recipe using the patterns from Track 2, and ship it. The next three lessons add the evaluation rigour the gallery's recipes assume: LLM-as-a-judge for free-form outputs, public benchmarks as smoke checks, and experiment tracking so today's "works" doesn't become tomorrow's "why did this regress?"

Level: intermediate Read time: ~10 min Prerequisites: the rest of Track 2

You've done sentiment classification end-to-end (Capstone A). The same pipeline — load model, build dataset, SFT loop with mask, evaluate against a baseline, ship — applies to a wide range of practical SLM tasks. The differences are surprisingly small: the dataset shape, the scoring mode, the right metric, and one twist per task. This lesson is six projects with those differences made explicit, so you can pick one and start.

Project 1 — Sentiment classifier (done)

You already shipped this in Capstone A. For reference:

Data shape: {prompt, completion} with the completion being a class label.
SFT loop: Lesson 2.10's SFTTrainer + LoRA.
Eval: macro-F1 + per-class report (Lesson 2.12); confusion matrix to see which classes confuse.
Twist: class balance — the trivial-majority-class trap.

Project 2 — Intent classifier (multi-class, real imbalance)

A router for a chatbot: 10–30 intents, often with one or two intents being far more common than the others (e.g. "track_order" outnumbers "talk_to_human" 20:1).

Data shape: same as sentiment ({prompt, completion=intent_name}) but with more classes.
SFT loop: identical to Project 1.
Eval: macro-F1 is the headline (Lesson 2.12). Per-class report; pay attention to recall on the rare intents — that's what an imbalanced classifier silently breaks.
Twist — handling the imbalance:
1. Upsample the minority intents in training (or downsample the majority).
2. Add specifically the hard rare cases (Lesson 1.7) — variations of the user phrasing for each minority intent.
3. Set a confidence threshold below which the router routes to "talk_to_human" — a refusal as a routing destination.

Project 3 — JSON extractor with Pydantic validation

Pull structured fields out of free text — invoices, support tickets, transcripts. The dominant practical use of SLMs in many companies.

Data shape: {prompt: "Extract ...: <text>", completion: <JSON object string>}.
SFT loop: Lesson 2.10. Low temperature at inference (Lesson 1.19).
Eval: Lesson 2.13's two-number report: valid-JSON rate + per-field accuracy on the parses that succeed.
Twist — schema enforcement:
1. Pydantic schema (Lesson 2.13) — the validator and the documentation of what the model should emit.
2. Optional: constrained decoding (e.g. outlines, jsonformer) at inference, which forces valid JSON at the cost of some quality.

Project 4 — PII detector (span extraction)

Find personal information in a string: names, emails, phone numbers, addresses. Differs from classification: there can be zero or many spans per input.

Data shape: {text, spans: [{start, end, label}, ...]} (Lesson 1.20's extraction format).
SFT loop: reframe as JSON extraction — the assistant's reply is the JSON array of spans. Same SFTTrainer pipeline.
Eval: entity-level F1 with HF evaluate's seqeval, plus per-type precision/recall (a model that gets phone numbers right and emails wrong is a different model from one that gets both at 80%).
Twist — false negatives are the costly direction: for safety-sensitive PII work, recall is the metric to defend. Missing a phone number is far worse than over-flagging some innocent strings. Tune the threshold or the data accordingly (more hard negatives that aren't PII; more positives covering rare formats).

Project 5 — FAQ assistant (RAG + small fine-tune)

"Given our docs, answer customer support questions." The first instinct is "fine-tune on the FAQ." Wrong — fine-tuning teaches behaviour, not facts (Foundations 0.7). A small fine-tune for tone + format, plus retrieval for the facts, is the right architecture.

Data shape: {prompt: question + retrieved_passages, completion: grounded_answer}. The completion only references the supplied passages.
SFT loop: Lesson 2.10. Multi-turn (Lesson 2.14) when the assistant should ask clarifying questions before answering.
Eval: faithfulness ("does the answer use only the passages?") + refusal accuracy ("does it refuse when the docs don't cover the question?") + format. BrewSLM's eval pack handles this taxonomy.
Twist — refusals are training data: include explicit examples where the right answer is "I don't have that information." Lesson 1.7's refusal data, made concrete.

Project 6 — Tool-call generator (structured function calls)

Given a user request and a list of available tools (functions with JSON schemas), the model emits a JSON object describing which tool to call and what arguments to pass. The agent / function-calling backbone.

Data shape: system message lists the available tools (with their schemas); user message is the request; assistant emits {"tool": "name", "arguments": {...}} or a refusal.
SFT loop: Lesson 2.10 + Lesson 2.14 (multi-turn, because tool calls often chain).
Eval: tool-call accuracy (right tool name) + argument-set match (the arguments match the gold) + valid-call rate (the JSON parses and is a real tool). Lesson 2.13's two-number report, tool-flavoured.
Twist — calling no tool is also a valid output: include negative examples where the right thing is "no tool needed, answer directly" or "request needs clarification." A model that calls a tool every time is broken.

Pattern-spotting across the six

Now the common skeleton is clear:

Pick the task shape (Lesson 1.5): single-label classification, multi-label, span extraction, structured generation, grounded generation, tool-call.
Pick the data format matching that shape (Lesson 1.20).
Run the SFT loop (Lesson 2.10) on a small base (or QLoRA on a bigger one, Lesson 2.11).
Pick the eval matching the task: sklearn classification report for labels (Lesson 2.12), Pydantic + per-field for JSON (Lesson 2.13), seqeval for spans, faithfulness + refusal for grounded generation.
Iterate on data — hard negatives, ambiguous cases, refusals, OOD (Lesson 1.7's extended taxonomy).

Honest beat — pick one, ship one

The single biggest mistake at this stage is picking three projects "to compare" and shipping none. Pick one, ship it end-to-end (data → train → eval → deploy), measure the lift against a base, then pick the next. The skills transfer; what doesn't transfer is the experience of getting all the way through a pipeline including the boring parts. Ship one before you scope two.

Key idea

The Track 2 pipeline applies to every common SLM use case with small, predictable changes — data shape, scoring mode, the one twist. Pick a project, write the recipe in the shape above, and use the rest of the track as the implementation. The next track shows how to run the same recipes through BrewSLM, where data import, eval packs, and deployment are platform surfaces.

The gallery shows what to build; the next three lessons of Track 2 sharpen how to know it works — LLM-as-a-judge for free-form outputs, public benchmarks (lm-eval-harness) as smoke checks against the base, and experiment tracking (MLflow / W&B) so you can compare runs three weeks from now and not lose. After that, Track 3 takes the same pipeline through BrewSLM.

Key terms

Task shape: The structural form of a task — single-label, multi-label, span extraction, structured generation, grounded generation, tool-call — that determines data format and scoring.
Scoring mode: The kind of metric matching the task shape (classification → F1; extraction → seqeval; structured → valid-rate + per-field; grounded → faithfulness).
Recipe: The minimum description of how to fine-tune for a task: data shape + LoRA knobs + scoring mode + the twist.
Faithfulness: Whether a generated answer is grounded in the supplied context (the RAG passages) rather than hallucinated.
Refusal accuracy: Whether the model correctly declines on out-of-scope / under-documented questions instead of fabricating an answer.
seqeval: The standard span-extraction metric library (HF evaluate exposes it); produces entity-level precision/recall/F1.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.