Customer ticket triage with multi-class classification and class imbalance
By the end of this tutorial you'll have a multi-class classifier that takes the body text of an incoming support ticket and emits the right team label — billing, technical, account, feature_request, bug_report, cancellation — with a confidence score. It runs at ~10ms per ticket on a single small GPU, doesn't ship customer data to a third-party API, and is honest enough about its uncertainty that mid-confidence tickets get shadow-routed for review instead of straight-routed and forgotten.
Before you start
This tutorial assumes BrewSLM is running locally at http://localhost:5173 with an admin user signed in. If you haven't done that yet, complete Tutorial 0 — Set up BrewSLM and your first project first. It takes ~15 minutes and is the prerequisite for every tutorial in this track.
You'll also want, before you start: ~1,000 historical tickets paired with the team that actually resolved them (export from Zendesk / Intercom / Salesforce / Freshdesk as CSV — most ticketing systems ship this in a few clicks). The team-routing labels are the highest-signal training data you have; everything else in this tutorial is about making the most of them.
Terms you'll see in this tutorial (click to expand)
- Recipe
- The training-plan template you pick when creating a project. For this tutorial:
classification— the same recipe as Tutorial 2's binary SQLi classifier, used here for a 6-way (multi-class) label vocabulary instead. - Multi-class classification
- One label per row, drawn from a fixed vocabulary of more than two classes. Distinct from binary classification (yes/no) and from multi-label (a single ticket might carry two labels at once — not what we're doing here; one ticket, one routing team).
- Class imbalance
- The natural ratio of classes in real-world ticket archives is skewed. Most tickets are
technicalorbilling;feature_requestandcancellationare typically <5% of volume each. A naive split + train produces a model that always predicts the majority class. - Shannon entropy (normalised)
- A single number summarising how balanced your class distribution is. 1.0 = perfectly balanced; 0.0 = a single class dominates. The platform's data-health diagnostic reports both the raw entropy (in nats) and a normalised version factored into the trainability forecast.
- CLASS_BALANCE_FILL playbook
- BrewSLM's synth generator for fixing class imbalance. Auto-detects the under-represented class in your gold set (or you pin it) and generates more examples of that class. Distinct from
POSITIVES_PARAPHRASE, which paraphrases existing rows in proportion to current counts. - macro_f1
- Average of per-class F1 scores, weighted equally regardless of class size. The headline classification eval metric — catches the imbalance bug where the model is great at the majority class and silently bad at the rest.
- Per-class F1 floor
- Eval gate that requires every individual class's F1 to clear a threshold. Catches the failure mode where one class is starved of training signal and macro_f1 is dragged down by it.
- Confidence threshold
- A floor on the model's predicted probability below which the deployed router declines to auto-route. The two-threshold pattern: high-confidence routes auto, mid-confidence shadow-routes (predicted team, flagged for review), low-confidence goes to a triage-manager queue.
This is BrewSLM's canonical workflow for a multi-class text classifier deployed on a routing path with natural class imbalance. The use case is customer ticket triage but the recipe shape generalises: bug-report categorisation, internal helpdesk routing, document type classification on a scanner ingest, alert-to-team routing in an SRE pipeline — anything where one piece of text needs one label drawn from a finite vocabulary, and the natural distribution isn't uniform. The bones are the same; only your label vocabulary and your historical-labels source change.
The end state is the model a support-ops team would actually run in production: small enough to run on the queue worker that already pulls tickets off your ingest, accurate enough on the busy classes to remove a measurable load from human triage, and well-calibrated enough on the rare classes that cancellation and feature_request tickets stop getting buried in the technical queue.
What you'll build
A six-way text classifier. Given the body text of an incoming ticket, it emits one label plus a confidence score. Sample inputs and outputs:
INPUT: "My credit card was charged twice for this month's subscription —
can you refund the duplicate?"
OUTPUT: { "label": "billing", "confidence": 0.97 }
INPUT: "The desktop app crashes on launch after the latest update.
Logs attached."
OUTPUT: { "label": "bug_report", "confidence": 0.91 }
INPUT: "Please cancel my account effective immediately. I don't want
to be billed for next month."
OUTPUT: { "label": "cancellation", "confidence": 0.94 }
INPUT: "Would love a dark mode in the mobile app — any plans?"
OUTPUT: { "label": "feature_request", "confidence": 0.88 }
INPUT: "I can't log in. Forgot my password and the reset email
never arrived."
OUTPUT: { "label": "account", "confidence": 0.83 }
The model is a fine-tuned LoRA adapter on top of SmolLM2-135M-Instruct running at typical 5-15ms per ticket on a single GPU, 30-60ms on CPU. Latency-friendly enough to sit inline on the ticket-ingest path; small enough to deploy onto the same machine that already runs your queue workers.
Key idea
Multi-class triage on a natural ticket stream lives or dies on the rare classes. Macro_f1 — the headline metric — averages per-class F1 equally regardless of class size, so a model that aces technical and billing but never predicts cancellation scores around 0.65 and fails the gate. The training drill that matters most is the CLASS_BALANCE_FILL playbook: generate examples for under-represented classes until the Shannon entropy of your gold set crosses a healthy threshold. Get this right and macro_f1 takes care of itself.
Why a small model (not rules, not a frontier API)
Three options exist for ticket triage. Use this comparison:
| Approach | Routing quality | Latency | Cost at 50k tickets/month | Privacy |
|---|---|---|---|---|
| Keyword rules ("refund" → billing, "crash" → bug) | Brittle — fails on intent overlap ("the technical team should refund this") and on tickets that don't surface the keyword at all | <1ms | Free (but a full-time PM to maintain the keyword list as labels evolve) | Self-hosted |
| Frontier LLM via API | Excellent on novel phrasing | 1-3 seconds per ticket | $150-$1,500/month plus per-token costs; multiplies fast at higher volumes | Every ticket — with customer names, account numbers, sometimes attached PII — leaves your network |
| Small fine-tuned classifier (this tutorial) | Good on novel phrasing if you curate balanced gold | 5-60ms | ~$30/month (one shared GPU, amortised across all your small models) | Self-hosted |
Keyword rules were how every support team did this in 2018. They still work for the top-of-funnel obvious cases — but the moment a customer writes "the technical team should fix the billing on my account", the rule stack picks the wrong team. Intent overlap is the rule-stack killer; small fine-tuned models read the semantics, not just the keywords.
Frontier-LLM-via-API solves the semantics problem but ships every ticket — names, account identifiers, sometimes financial details — to a third-party. For most B2B SaaS support orgs that's a compliance non-starter even before the cost. Small fine-tuned models occupy the gap: semantics-aware like an LLM, fast and self-hosted like a keyword rule.
Choose your dataset
You need two things: ticket text and the team that resolved it. Your historical ticket archive is the spine; public datasets are useful for warmup and for stretching template diversity.
- Your own ticket archive (the spine)
- Export 1,000-5,000 recent tickets from Zendesk / Intercom / Salesforce / Freshdesk as CSV: the ticket body and the team that resolved it. The resolving team is your gold label proxy — not perfect (mid-way reassignments are noisy), but the best signal available without manual relabeling.
- Bitext customer-support corpus
bitext/Bitext-customer-support-llm-chatbot-training-datasetships 26,872 examples across 27 intents. Useful for warmup; labels won't map 1:1 to your team taxonomy but the phrasing variation is good signal.- CLINC150
clinc150ships 150 intents across 10 domains — useful for testing the end-to-end multi-class workflow before you commit real ticket data.- Banking77
banking77is 13,083 customer-service queries across 77 banking-specific intents. Useful if your B2B product is in fintech and yourbillingclass needs more variety.
Public data is for diversity, not coverage
Your archive's tickets are what the model will see in production. Public corpora round out phrasing variety so the model doesn't memorise your top customer accounts' writing style. Treat them as a 10-20% supplement — never the spine. A model trained 100% on bitext will be brilliant at the phrases a synthetic-data team chose and useless on the weird, specific phrasings your real customers actually use.
Ingest and map
In BrewSLM, create a new project: Projects → New Project → classification recipe. The recipe pre-fills the adapter (classification-label), task profile (classification), scoring mode (field_match), and the eval pack (the classification scaffold — macro_f1, accuracy, per-class F1 floor, optional safety pass rate).
Open Data Studio → Import. Your CSV should look like:
text,label
"My card was charged twice this month — please refund the duplicate.",billing
"App crashes on launch after today's update. Logs attached.",bug_report
"Please cancel my account effective end of month.",cancellation
"Can you add a dark mode to the mobile app?",feature_request
"Forgot my password and the reset email isn't arriving.",account
"The API is returning 502 on /v1/orders intermittently.",technical
Two columns: text (the ticket body) and label (one of your six team labels). The mapping picker shows a confidence-scored preview of the column inference; click Apply mapping once the labels look right.
✓ Checkpoint: the Data Studio Overview now shows your imported row count and the Quality & Safety panel surfaces a per-class breakdown (e.g. "technical: 1,840, billing: 1,210, account: 480, bug_report: 380, feature_request: 92, cancellation: 64 — Shannon entropy 1.34 nats, normalised 0.57"). Anything below 0.7 normalised is meaningful imbalance; the goal ledger will surface this as a forecast penalty on the data-ready row, and CLASS_BALANCE_FILL becomes the priority before any training.
Label normalisation
Make sure your label vocabulary is exactly the six strings you want — not billing in some rows and Billing in others, not tech in one batch and technical in the next, not cancel alongside cancellation. The classifier emits the label string verbatim; capitalisation drift and synonyms during training show up as model "uncertainty" between near-identical classes and tank macro_f1 with nothing visible to fix in the eval gates. Run a quick spreadsheet sort + count before import.
Cleanup and Shannon-entropy diagnostic
Open Data Studio's Quality & Safety panel. Four common artefact families to clean:
- Auto-reply signatures and quoted threads. Email-sourced tickets carry auto-responder banners and quoted earlier replies — high-frequency noise the model will memorise. Strip them before training.
- Near-duplicates. A customer who fires the same complaint three times generates three near-identical rows. Train on one; drop the rest. The dedup signal catches these.
- Whitespace + encoding. Strip BOMs, collapse whitespace, normalise smart-quotes. Free quality improvement.
- PII. Customer names, emails, account numbers leak into ticket bodies. Decide your policy: redact, drop, or hash. The platform doesn't auto-redact; every row is an explicit decision. For routing, redaction is usually right — the team choice shouldn't depend on the customer's actual name.
Then look at the diagnostic driving everything else: Shannon entropy. The platform reports both raw entropy (in nats) and a normalised form (0.0-1.0, where 1.0 is perfectly balanced). Reading the number:
- Normalised ≥ 0.85 — well-balanced; no action.
- Normalised 0.70-0.84 — modest imbalance; a single
CLASS_BALANCE_FILLpass on the bottom one or two classes. - Normalised 0.50-0.69 — meaningful imbalance; the trainability forecast will discount your predicted_pass. Run
CLASS_BALANCE_FILLuntil ≥ 0.7 before training. - Normalised < 0.50 — severe; don't train yet. Gather more data or run aggressive balance-fill until you cross 0.7.
Normalised entropy is the single most informative number on a multi-class triage dataset. If it's healthy, the model probably trains well. If it's not, no training config will save you.
Pick the recipe: classification or something else?
The decision tree for ticket triage:
| You want… | Use | Why |
|---|---|---|
| A single team label per ticket, with confidence | classification | Multi-class label with a fixed vocabulary; the eval pack scores macro_f1 + per-class F1; downstream router consumes the (label, confidence) pair |
| To highlight WHICH tokens drove the routing decision | span-extraction | Per-token spans labeled with the team they pointed at; useful for SOC-style analyst dashboards (see Tutorial 3) |
| A free-text explanation alongside the routing decision | qa-sft | "This should go to billing because…"; useful for analyst-tooling, not for queue dispatchers |
| Multiple labels per ticket (one ticket routes to two teams) | Not directly supported — split each multi-label ticket into N single-label rows during ingest, or model the joint vocab as a synthetic compound label (e.g. billing+technical) | The classification recipe expects one label per row from a fixed vocab |
| Hierarchical routing (top-level team, then sub-team) | Two classifiers chained — see "What's next" | One model for the parent label, a second model conditional on the parent — cleaner than a flat 50-way vocabulary |
For the canonical six-team triage use case: classification. Sticking with it for the rest of this tutorial.
Domain packs (the support gap)
BrewSLM doesn't currently ship a support-domain pack tuned for triage out of the box. There is a generic platform fallback pack that any project inherits if you don't pin one, and there's a separate workflow shape for support FAQ generation (covered in Tutorial 1) — but that's a different shape (free-text answer over a retrieval index), not a label classifier. For this tutorial you're operating on the recipe-level defaults plus your own gold set, which is fine for a first project.
Building a custom support-triage domain pack is a worthwhile follow-up project that this tutorial intentionally doesn't cover. It would bundle: stricter per-class F1 floors on revenue-protective labels (cancellation, billing), curated phrasing seeds for the long-tail intents your team specifically sees, a glossary linking eval gates to support-org SLO language, and an Academy tag pointing at this tutorial. If your customer-success org is shipping multiple triage classifiers (consumer vs B2B, free-tier vs enterprise), packaging the conventions as a domain pack pays back fast.
Build the gold set — your senior agents are the oracle
For ticket triage you have a higher-signal oracle than any other tutorial in this series: your senior support agents have already routed tens of thousands of tickets correctly. Their dispositions are your gold labels for free.
Path A — bootstrap from your ticket archive
For each historical ticket, pair the body text with the team that ultimately resolved it. Usually a one-time CSV export from Zendesk / Intercom / Salesforce / Freshdesk under analytics → tickets. Two caveats:
- Mid-ticket reassignments. A ticket that started in
technicaland was bounced tobillinghas noisy gold. Simplest filter: only include tickets whose first-assigned team equals their final-resolved team. You lose ~10-15% of volume; you gain confidence that the label is right. - Stale labels. Your team taxonomy probably evolved. Restrict your gold pull to the last 12 months unless the taxonomy has been stable longer.
Aim for 1,500-3,000 gold rows from your archive — at least 50 examples of every class.
Path B — manual seed from senior agents
For the rare classes your archive doesn't cover well, ask senior agents to hand-label 30-50 fresh examples each. Their muscle memory is the strongest signal on the ambiguous cases newer agents would have routed wrong. The Gold Set workbench (Data Studio → Gold Set) supports per-row manual entry with a label dropdown.
Path C — LLM-assisted promotion
For tickets whose archive label is missing or suspect, use the standard "promote from raw" flow: bulk-import a few thousand candidate rows, run a teacher model (Ollama / OpenAI / Anthropic) via the platform's synth-backend with a prompt that takes your six-label vocabulary as input and asks the teacher to pick one per row, then review by class in the synth queue. A 30-minute teacher run plus an hour of grouped review produces 300-500 labelled rows.
✓ Checkpoint: the Data Studio Overview's Gold Set ready row should now be green ("2,400 gold rows ready · 100 recommended"). Check the per-class breakdown: every class should have ≥ 50 examples. If cancellation sits at 12 rows because cancellations are rare, manual seed + CLASS_BALANCE_FILL have to fix that gap before training.
Don't skip the "ambiguous" examples
Your gold must contain rows where the routing decision was genuinely hard — tickets that legitimately span two teams. Mark them with the team that actually got the ticket and let the model learn your org's tie-breaking convention. Without them, the model learns a clean separation production doesn't exhibit.
Stratified split — pre-partition into three files
The platform's Prepare Dataset panel runs a uniform random split off the ratios + a seed you set. For multi-class with imbalance, a uniform random split is fragile: with 64 cancellation rows in a 2,400-row dataset, an unlucky seed will put 4 of them in val and 6 in test, and your per-class F1 on cancellation becomes essentially noise. You want a stratified split — train, val, and test each carry the same per-class ratio as the whole — but the Prepare Dataset panel doesn't surface a "stratify by column" knob today.
The practical pattern: pre-partition into three files before import. In whatever scripting environment you're comfortable with, do a stratified split locally (Python's sklearn.model_selection.train_test_split with stratify=labels, or pandas group-by-label sample-fraction) producing three CSVs — tickets-train.csv, tickets-val.csv, tickets-test.csv — each carrying the same class proportions. Import each as a separate dataset in Data Studio and tag it as train / val / test in the dataset metadata.
For a 2,400-row gold set with the imbalanced distribution from earlier, an 80/10/10 stratified split produces approximately 1,920 train / 240 val / 240 test where every class has at least 6-7 examples in val + test even on the rarest classes. That's enough for the eval handler to compute a meaningful per-class F1 instead of a noisy one.
If you can't pre-partition
If you have to use the platform's random-split flow, set the split ratio higher on val + test (say 70/15/15) to give rare classes more cushion against bad-seed roulette. Fix the seed across runs so your eval numbers are comparable between training experiments. This is a fallback — pre-partition is the right answer.
Generate CLASS_BALANCE_FILL + paraphrase + hard-negative drills
The classification recipe ships four playbooks in the Playbook Center: classification_paraphrase, classification_hard_negatives, classification_class_balance_fill, and classification_cluster_targeted. For ticket triage, run them in this order:
CLASS_BALANCE_FILL— the headline drill- Auto-detects the under-represented class in your gold set (or pin one via
target_class) and generates more examples for it. Goal: pull your gold's normalised Shannon entropy to ≥ 0.7 (ideally ≥ 0.85). Turns a 30-rowcancellationclass into 120 trainable examples without hand-writing 90 more tickets. Generate ~50-80 rows per under-represented class. POSITIVES_PARAPHRASE— coverage extender- Vary the wording of existing gold rows while keeping the label fixed. The model learns that "could you refund the double charge" and "I want a refund on the duplicate billing" are the same intent. Generate ~50 rows per major class.
HARD_NEGATIVES— class-boundary disambiguation- For triage the most useful hard negatives are tickets that mention a different team but should route elsewhere: "the technical team should fix this billing bug" (mentions technical, is billing); "I'm cancelling because the app keeps crashing" (mentions cancelling, is a bug-report). Generate ~40 rows.
CLUSTER_TARGETED— post-eval drill- After a first training pass, the Evaluation tab's FailureClustersPanel groups failures into clusters with a representative example each. The cluster-targeted playbook seeds new training rows from the worst cluster. Don't run this on the first pass — it needs an eval result to operate on.
Open Data Studio → Synthetic → Playbook Center. The classification recipe surfaces four playbook cards. Click CLASS_BALANCE_FILL first; pick your lowest-count class; set target count to 60; pick a backend (Ollama is the free default). Generation runs as a background Job.
Run CLASS_BALANCE_FILL before paraphrase
For an imbalanced multi-class dataset, balance-fill is the precondition for everything else. Paraphrase amplifies whatever shape your gold has — if technical outnumbers cancellation 30:1, paraphrasing the whole gold set preserves that 30:1 ratio (which is exactly what you don't want). Run balance-fill until entropy is healthy, THEN run paraphrase across the now-balanced gold, THEN run hard-negatives on the high-confusion class boundaries you've identified. Order matters more than volume.
Review the synth queue (bulk by class)
Every generated row lands in the Synthetic Review Queue with review_status="pending". The queue groups rows by source playbook; within each you can filter by class — useful after running balance-fill across multiple classes.
Per-row actions:
- Accept — the row joins training on the next dataset-prep run. For balance-fill specifically: check the generated example actually fits the target class. A teacher asked for 60
cancellationexamples will sometimes produce a goodbillingcomplaint mis-labelledcancellationbecause the prompt anchored on the wrong word. - Reject (soft) — marked rejected with an optional reason tag. The useful tags for triage:
actually-class-X— reads cleanly as a different class than generated for. Common on balance-fill for rarer classes.too-short— one- or three-word "tickets" with no signal. ("Refund please." teaches nothing even if the label is right.)mixed-intent— the row legitimately spans two classes. Reject — but reflect on whether your label vocabulary is missing a category.label-drift— right intent but the teacher used a synonym (cancelinstead ofcancellation). Reject; the label vocabulary must be exact.
- Purge — section-level action filtered by reason. Use once you're confident a reason-bucket is bad data. Never drop
mixed-intentwithout thinking — those rows may reveal a real label-vocab gap.
Expect to reject 25-40% of balance-fill rows on first pass for the rarest classes — the teacher has fewest anchors to learn from. Acceptance rate climbs as you tune the prompt and the gold set grows.
✓ Checkpoint: after a review pass, the synth panel should show "accepted: 220 · rejected: 80 · pending: 0" or similar. Run the Shannon-entropy diagnostic again — your normalised entropy should now be ≥ 0.7. If it's not, run another balance-fill pass on the still-rarest class.
Training configuration
Open Training → New Experiment. Recipe defaults are sensible — for a first run, accept all of them:
- Base model
HuggingFaceTB/SmolLM2-135M-Instruct. The recipe trains the model to emit the label as a token sequence; the eval handler does a field-match comparison against gold. Alternatives:Qwen/Qwen2.5-0.5B-Instructfor marginally better accuracy at 4x the size;distilbert-base-uncasedfor an encoder-only baseline.- Adapter
classification-label. Wraps each row's text + label into a prompt + reference-completion pair.- Epochs
- 5-6. Multi-class with 2,000-3,000 gold rows needs slightly more epochs than binary — less per-class signal per epoch because rare classes are seen less. Three underfits the long tail; seven starts to overfit.
- Batch size
- Batch 8, no accumulation. Tickets are short (50-300 tokens) so memory isn't the constraint.
Expected runtime: 8-20 minutes on a single GPU, 25-50 minutes on CPU. The training panel shows live loss + validation macro_f1; if macro_f1 plateaus below 0.55 by epoch 3, kill the run — it's almost always a class-balance or labelling-consistency problem.
✓ Checkpoint: the experiment row shows loss trending down and validation macro_f1 trending up. By end of epoch 4-5 you should see val macro_f1 ≥ 0.70 and per-class F1 ≥ 0.55 on the rarest class. When training completes the bell pings; the detail page surfaces the per-class precision/recall/F1 grid.
Read the trainability forecast
Before training, the goal ledger's predicted_pass row forecasts based on row count, normalised class entropy, and base model size. For imbalanced multi-class:
- Predicted pass probability ≥ 65%. Lower usually means entropy penalty — fix balance first.
- Normalised class entropy ≥ 0.7. The forecast multiplies data-quality by entropy; 0.5 cuts the forecast roughly in half.
- Every class has ≥ 50 gold rows. Below that the rare class under-trains and the per-class F1 floor fails.
One nuance: the forecast sometimes over-estimates macro_f1 when the per-class floor is the binding constraint. It knows row count and entropy; it doesn't know that 50 rows of cancellation won't quite teach the model enough to clear F1 ≥ 0.50 on that class. Treat predicted_pass as a green/yellow/red light, not a precise prediction. Green with entropy 0.55 still warrants a balance-fill pass.
Evaluation: macro_f1, accuracy, and the per-class F1 floor
After training, the platform automatically evaluates against the project's eval pack. The classification scaffold ships four gates:
min_macro_f1≥ 0.65 (required)- The headline. Average of per-class F1, weighted equally. The metric that catches the imbalance bug where the model is great at majority classes and silently bad at minority ones — flat accuracy alone won't surface this on a 30:1 distribution.
min_accuracy≥ 0.70 (required)- Coarse sanity check. Pairs informatively with macro_f1: high accuracy + low macro_f1 = imbalance bug (the model gets the common classes right and gives up on the rare ones); low accuracy + reasonable macro_f1 = bad model (no class is doing well, retrain).
min_per_class_f1≥ 0.50 (required)- Every individual class's F1 must clear 0.50. This is the class-starvation tripwire. A model that scores 0.85 macro_f1 but has
cancellationF1 at 0.30 fails this gate even though the average looks healthy. For triage this is exactly the gate you want: it refuses to ship a model that quietly drops the rare classes. min_safety_pass_rate≥ 0.93 (optional)- Catches refusal / off-topic / adversarial inputs that should be flagged through a different path. Useful when the classifier sits behind a customer-facing UI; not strictly required for an internal routing service.
The goal ledger's eval_pass_rate row expands into the per-gate breakdown so you see exactly which dimension is failing — "macro_f1 0.71 ≥ 0.65 passed; min_per_class_f1 0.42 < 0.50 failed on class cancellation."
Reading the per-class grid
The experiment detail page shows the per-class precision / recall / F1 grid. Three patterns to look for:
- One class starves. Five classes at 0.80+ F1, one at 0.25. Almost always under-training on the rare class. Fix: more
CLASS_BALANCE_FILLrows, then retrain. - Two classes collapse. The model rarely distinguishes
bug_reportfromtechnical. The label vocabulary has a boundary problem — either the classes overlap semantically (consider merging) or your gold is inconsistent across them (have a second reviewer relabel 50 random rows from both and reconcile). - Rare classes get zero predictions. The model never emits
cancellation; precision is undefined. The model learned that always-predicting-something-else is the lowest-loss policy. RunCLASS_BALANCE_FILLuntil that class has 100+ training rows.
If you want per-class precision/recall floors (e.g. cancellation recall ≥ 0.80 because missing a cancellation has direct revenue impact), the platform doesn't ship a UI editor for per-class gate authoring today. The path: copy the scaffolded eval pack JSON, add the per-class metric gates you need (referencing the per-class metric IDs the eval handler emits), save it under a custom pack ID, and select your custom pack from Project → Eval Pack in place of the default scaffold.
When the eval fails (common failure modes)
Multi-class triage failures concentrate around imbalance. Common patterns and fixes:
| Symptom | Root cause | Fix |
|---|---|---|
| Accuracy 0.82, macro_f1 0.51, per-class F1 fails on the rare class | Class imbalance — the model defaults to majority classes | Run CLASS_BALANCE_FILL on the under-represented class until normalised entropy ≥ 0.85. Retrain. |
| Two classes blur (low precision on both) | Class boundary is genuinely ambiguous in the gold | Sample 40 random rows from each class; have a second senior agent relabel blind. Disagreement > 20% = the classes overlap semantically. Either merge them in the label vocab or document a tie-breaking convention and re-label gold against it. |
| Rare class precision/recall = 0 across the test set | Too little training signal; model learned never to emit it | Aggressive CLASS_BALANCE_FILL plus 30+ manually-seeded rows from senior agents. More epochs amplifies the bias toward majority classes; that's NOT the fix. |
| Macro_f1 strong on val (0.78) but tanks on test (0.55) | Random split drew an unlucky distribution | Re-split with stratification (pre-partition; see the Split section). |
| Mixed-intent tickets get bounced between predictions | Label vocab doesn't fit hybrid intents | Inspect the mixed-intent rejection bucket from the synth queue. If persistent: add a compound label or document a tie-breaking convention. |
| Post-eval decision engine recommends a different recipe | Engine sees signals your task shape is wrong | Read "Why this fired?". qa-sft → you may need free-text explanations; span-extraction → routing may depend on which tokens triggered the choice. Investigate. |
The FailureClustersPanel groups failures into clusters with a representative example each. Reading 5-10 examples from a cluster usually surfaces the failure shape in 30 seconds. The panel also offers a one-click launch of CLUSTER_TARGETED seeded from that cluster — the most efficient drill for a specific failure mode.
Ship with confidence thresholds
Once the eval pack passes — and especially once the per-class F1 floor passes on the revenue-protective classes — ship in three steps:
- Export the LoRA adapter. Open Models → Export. The platform writes adapter weights, tokenizer config, and a deploy manifest into
data/projects/<id>/exports/. The adapter is ~5-15 MB; the base model loads fresh at deploy time. - Deploy via vLLM or Ollama. The export bundle includes a launch script:
Ollama variant:cd data/projects/<id>/exports/run-2026-06-05 ./deploy-vllm.sh # Serves the base + LoRA adapter on localhost:8000 via standard # chat-completions. The model emits the label as its response. You'll # typically wrap this in a thin /triage microservice that parses the # response, extracts the top-token log-prob as a confidence proxy, and # returns a (label, confidence) pair to your caller. # Latency: 5-15ms per ticket on a single GPU, 30-60ms on CPU../deploy-ollama.sh. - Wire the two-threshold router. Your ingest pipeline POSTs each new ticket to your
/triagemicroservice, gets back (label, confidence), and routes:- High-confidence (≥ 0.85): auto-route to the predicted team.
- Mid-confidence (0.60-0.84): shadow-route to the predicted team but flag for human confirmation. Corrections are high-signal training data.
- Low-confidence (< 0.60): triage-manager queue. The model's prediction is suggested but not pre-selected.
The thresholds above are a starting point — calibrate on YOUR traffic. Shadow-mode for two weeks before turning auto-route on: log the model's prediction and confidence alongside the team the human triager actually picked; compute precision per confidence band; set the high-confidence threshold where per-class precision exceeds 0.95.
Calibrate, don't trust the eval-set confidence
The confidence score from the model is well-calibrated on the test-set distribution. Your production traffic is not your test-set distribution — it shifts over time, accounts evolve, support tooling changes the way tickets get phrased. Always shadow-mode before turning the auto-route on, and re-calibrate the thresholds at least quarterly. The default thresholds above are starting points, not destinations.
What's next
You have a deployed multi-class triage classifier with calibrated thresholds, a human-review fallback for mid- and low-confidence cases, and a per-class F1 floor that refuses to ship a model that drops the rare classes. Three next moves:
- Hierarchical routing
- If your org has sub-team specialisations (billing → refunds vs invoicing; technical → API vs infra), a flat 20-way classifier struggles. Chain two: the first emits the top-level team (the model you just shipped); a second, conditional on the parent, emits the sub-team. Same recipe + workflow, run twice.
- Active learning from disposition signals
- The highest-signal training data is in your shadow-routed and human-overridden queue. Every reviewer correction is a row proving the model didn't generalise. Capture corrections; promote the confident ones; retrain when the corrected pile crosses ~100 rows. Over a quarter the model gets sharper on the tickets your org actually sees.
- Expand the label vocabulary as new categories emerge
- Six labels today won't be six in two years. New product surface areas and new revenue motions create categories that didn't exist before. Retraining cost is low; the hard part is governance — every label-vocab change invalidates eval-comparable historical runs, so lean conservative.
Same recipe, different shape: Tutorial 2 — SQL injection classifier uses the same classification recipe for a binary problem with hard-negatives as the headline drill. The decision-tree: binary with balanced classes → hard-negatives is the spine; multi-class with imbalance → balance-fill is the spine.
From there, the tutorials hub links to other recipe shapes and the deeper Academy tracks.
Key terms
- Multi-class classification
- One label per row drawn from a fixed vocabulary of more than two classes. Distinct from binary (yes/no) and from multi-label (one row, multiple labels).
- Class imbalance
- The natural ratio of classes in real-world data is skewed. A model trained on imbalanced data defaults to majority-class predictions unless the gold set is balanced (via more data or via
CLASS_BALANCE_FILL) and the eval gates require per-class minimums. - Shannon entropy (normalised)
- A single number summarising how balanced your class distribution is — 1.0 perfectly balanced, 0.0 dominated by one class. Surfaced in Data Studio's Quality & Safety panel and factored into the trainability forecast.
CLASS_BALANCE_FILLplaybook- BrewSLM's synth generator for fixing imbalance. Generates more examples of the under-represented class until entropy is healthy. Distinct from
POSITIVES_PARAPHRASE, which paraphrases in proportion to existing counts. - macro_f1
- Average of per-class F1 scores weighted equally regardless of class size. The headline gate in the classification eval pack; catches imbalance bugs flat accuracy misses.
- Per-class F1 floor (
min_per_class_f1) - Required gate that every individual class's F1 must clear a threshold. The class-starvation tripwire — a model that scores high macro_f1 by averaging away one bad class still fails this gate.
- Confidence-thresholded routing
- Deployment pattern where the model's predicted probability decides whether the router acts on the prediction (auto-route), shadow-routes for human confirmation, or punts to a triage-manager queue. The thresholds are calibrated against real production traffic, not the test set.
Check yourself
Answers are saved to this browser.