Track 2 · Hands-on · Lesson 15

Project gallery: 6 SLM use cases as recipes

After this lesson you can pick a concrete SLM project from six common shapes — classification, structured extraction, generation with knowledge, tool-call generation — express it as a recipe using the patterns from Track 2, and ship it. The next three lessons add the evaluation rigour the gallery's recipes assume: LLM-as-a-judge for free-form outputs, public benchmarks as smoke checks, and experiment tracking so today's "works" doesn't become tomorrow's "why did this regress?"

Level: intermediate Read time: ~10 min Prerequisites: the rest of Track 2

You've done sentiment classification end-to-end (Capstone A). The same pipeline — load model, build dataset, SFT loop with mask, evaluate against a baseline, ship — applies to a wide range of practical SLM tasks. The differences are surprisingly small: the dataset shape, the scoring mode, the right metric, and one twist per task. This lesson is six projects with those differences made explicit, so you can pick one and start.

Project 1 — Sentiment classifier (done)

You already shipped this in Capstone A. For reference:

Project 2 — Intent classifier (multi-class, real imbalance)

A router for a chatbot: 10–30 intents, often with one or two intents being far more common than the others (e.g. "track_order" outnumbers "talk_to_human" 20:1).

Project 3 — JSON extractor with Pydantic validation

Pull structured fields out of free text — invoices, support tickets, transcripts. The dominant practical use of SLMs in many companies.

Project 4 — PII detector (span extraction)

Find personal information in a string: names, emails, phone numbers, addresses. Differs from classification: there can be zero or many spans per input.

Project 5 — FAQ assistant (RAG + small fine-tune)

"Given our docs, answer customer support questions." The first instinct is "fine-tune on the FAQ." Wrong — fine-tuning teaches behaviour, not facts (Foundations 0.7). A small fine-tune for tone + format, plus retrieval for the facts, is the right architecture.

Project 6 — Tool-call generator (structured function calls)

Given a user request and a list of available tools (functions with JSON schemas), the model emits a JSON object describing which tool to call and what arguments to pass. The agent / function-calling backbone.

Pattern-spotting across the six

Now the common skeleton is clear:

  1. Pick the task shape (Lesson 1.5): single-label classification, multi-label, span extraction, structured generation, grounded generation, tool-call.
  2. Pick the data format matching that shape (Lesson 1.20).
  3. Run the SFT loop (Lesson 2.10) on a small base (or QLoRA on a bigger one, Lesson 2.11).
  4. Pick the eval matching the task: sklearn classification report for labels (Lesson 2.12), Pydantic + per-field for JSON (Lesson 2.13), seqeval for spans, faithfulness + refusal for grounded generation.
  5. Iterate on data — hard negatives, ambiguous cases, refusals, OOD (Lesson 1.7's extended taxonomy).

Honest beat — pick one, ship one

The single biggest mistake at this stage is picking three projects "to compare" and shipping none. Pick one, ship it end-to-end (data → train → eval → deploy), measure the lift against a base, then pick the next. The skills transfer; what doesn't transfer is the experience of getting all the way through a pipeline including the boring parts. Ship one before you scope two.

Key idea

The Track 2 pipeline applies to every common SLM use case with small, predictable changes — data shape, scoring mode, the one twist. Pick a project, write the recipe in the shape above, and use the rest of the track as the implementation. The next track shows how to run the same recipes through BrewSLM, where data import, eval packs, and deployment are platform surfaces.

The gallery shows what to build; the next three lessons of Track 2 sharpen how to know it works — LLM-as-a-judge for free-form outputs, public benchmarks (lm-eval-harness) as smoke checks against the base, and experiment tracking (MLflow / W&B) so you can compare runs three weeks from now and not lose. After that, Track 3 takes the same pipeline through BrewSLM.

Key terms

Task shape
The structural form of a task — single-label, multi-label, span extraction, structured generation, grounded generation, tool-call — that determines data format and scoring.
Scoring mode
The kind of metric matching the task shape (classification → F1; extraction → seqeval; structured → valid-rate + per-field; grounded → faithfulness).
Recipe
The minimum description of how to fine-tune for a task: data shape + LoRA knobs + scoring mode + the twist.
Faithfulness
Whether a generated answer is grounded in the supplied context (the RAG passages) rather than hallucinated.
Refusal accuracy
Whether the model correctly declines on out-of-scope / under-documented questions instead of fabricating an answer.
seqeval
The standard span-extraction metric library (HF evaluate exposes it); produces entity-level precision/recall/F1.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.