End-to-end tutorials

Ship real projects, end to end

The other Academy tracks teach concepts. This one walks you through complete BrewSLM projects — from picking a dataset to shipping a deployed model behind a real endpoint. Each tutorial is one specific use case: support FAQ, SQL injection classifier, invoice extraction, customer ticket triage, code review, PII span tagging. Same workflow shape, different recipe per use case. New to BrewSLM? Start with Tutorial 0 — it sets you up and tours the UI in 15 minutes so the rest of the tutorials make sense.

Start with Tutorial 0 → All tracks

0. Set up BrewSLM and your first project
Clone the repo, start the backend on port 8000, start the frontend on port 5173, log in as the bootstrap admin user, create an empty project, and tour the main surfaces (Data Studio, Training, Evaluation, Playground, the notification bell). The prerequisite for every other tutorial. No machine learning yet — just enough to land you in front of a working empty project ready to import data. Includes troubleshooting for the most common first-time issues (port conflicts, pip install failures, "no project access" after sign-in).

Recipe: n/a · Time: ~15 min · Difficulty: beginner
1. Build a support FAQ assistant with the rag-protocol recipe
End-to-end: pick a dataset of FAQ triples (context, question, answer with citation), build a 60-row gold set, generate refusal / citation / format drills via the rag-protocol playbooks, train, score against the discipline pack (citation rate, hallucination rate, appropriate refusal rate), ship behind vLLM or Ollama with the BM25 retrieval index. Domain-agnostic — the same recipe works for ecom FAQ, legal QA, internal IT support, healthcare KBs.

Recipe: rag-protocol · Time: ~2h · Difficulty: intermediate
2. SQL injection classifier (security)
A binary classifier that flags injection-shaped queries inline before your DB layer at sub-10ms latency. Covers the classification recipe, the hard-negatives playbook (the make-or-break drill for security classifiers), per-class precision floors, an adversarial held-out set for novel-pattern generalisation, and the shadow-mode → tee-mode → inline-blocking deployment progression. Public datasets (OWASP, Kaggle) get you started; your own application logs supply the hard negatives.

Recipe: classification · Time: ~2.5h · Difficulty: intermediate
3. Invoice field extraction (structured)
A structured-extraction model that pulls vendor, total, line items, and dates out of free-form invoice text. Covers the span-extraction recipe with span-set scoring, gold-set construction via the span-tagging workbench, LLM-assisted promotion from existing AP records, stratified split by invoice template (avoids template-overlap leakage), and per-entity precision floors (the "total" gate sits tighter than "line items"). Public starting points: FUNSD, CORD, DocBank.

Recipe: span-extraction · Time: ~2.5h · Difficulty: intermediate
4. Customer ticket triage (multi-class)
Route incoming tickets to the right team automatically. Covers multi-class classification, class-imbalance diagnosis via the Shannon entropy signal in Quality & Safety, the classification_class_balance_fill playbook as the headline drill, per-class F1 floors (the gate that catches a "starving" rare class), and a confidence-thresholded routing pattern: auto-route on high confidence, shadow-route on mid, queue on low.

Recipe: classification · Time: ~2.5h · Difficulty: intermediate
5. Internal knowledge-base QA (qa-sft path)
The qa-sft counterpart to Tutorial 1's rag-protocol. When to pick memorisation over retrieval: small (<100 facts), stable corpora, offline-operation requirements, no citation discipline needed. Covers the decision matrix vs rag-protocol, the two qa-sft playbooks (paraphrase + cluster-targeted — note hard-negatives and class-balance-fill are NOT shipped for this recipe), and what to do when the post-eval decision engine recommends a reroute-to-RAG (the one-click sibling-project clone).

Recipe: qa-sft · Time: ~2h · Difficulty: intermediate
6. Code review nitpicker
Train a model to comment on code diffs the way your senior engineers do. Covers the code-review recipe, LLM-judge as the primary eval gate (since "right answer" is rarely exact-match for reviews), the by-author / by-repo split to prevent stylistic leakage, the hard-negatives playbook for "LGTM that should have flagged a bug" examples, and the draft-suggestion-mode-before-blocking deployment progression. Public starting points: CodeReviewer, CodeSearchNet, your own PR archive.

Recipe: code-review · Time: ~2.5h · Difficulty: intermediate
7. PII span tagging (safety)
Highlight emails / phone numbers / addresses / SSNs / credit cards in free-form text. Sibling tutorial to #3 (same span-extraction recipe, different annotation pattern: closed-set extraction vs open-set tagging). Covers per-entity recall floors (SSN ≥ 0.98, names ≥ 0.85), Faker-synthesized training data (never train on real PII), and the explicit-review-before-redaction pattern (audit log required; no auto-redact without keeping the original).

Recipe: span-extraction · Time: ~2.5h · Difficulty: intermediate
8. Legal contracts via a custom domain pack
The headline tutorial for the domain-pack layer. The platform doesn't ship a legal pack today; you build one. Covers what a domain pack IS (a typed configuration overlay on top of a recipe), the Domain Pack Manager UI, the cleaning + eval-threshold + glossary fields you can override, how packs are assigned per project, AND the current gap honestly documented: refusal-phrase swapping per pack is not yet wired through the playbook prompts — you customize the playbook prompt manually at runtime.

Recipe: rag-protocol + custom legal pack · Time: ~3h · Difficulty: intermediate-advanced
9. Building domain packs — the deep reference (finance / AP example)
The canonical deep-dive on pack construction. T8 used legal as a quick walkthrough; this tutorial is ~70% pack methodology and ~30% finance worked example. Covers pack anatomy field-by-field (sourced from domain_pack_service.py), the cleaning / weighted-score-evaluator / glossary overlays, currency normalisation per region (USD / EUR / GBP / INR / JPY / CHF conventions), per-entity precision priorities tied to SOX key controls, versioning + manual sharing, applying to invoice + PO + statement projects, and the anti-patterns that wreck packs in the long run. Honestly documents which overlays are JSON-only today and where the platform currently leaves work to your team (default-normalizer is the pass-through default; the DomainPackManager picker now badges safe-cleanup-normalizer with a ★ as the recommended swap-in alongside 4 other built-in normalizers — whitespace-collapse, html-entity-decode, lowercase-text, currency-canonical — so no plugin is required to activate cleanup behavior; per-entity gating still lives in the eval pack, not the domain pack).

Recipe: span-extraction + custom finance pack · Time: ~3.5h · Difficulty: advanced

What's covered in every tutorial

Each tutorial walks the same 15 stages so once you've done one, the rest are a different shape of the same workflow:

What you'll build (concrete deliverable)
Why this approach (cost / latency / scope framing)
Dataset selection (public sources + your own)
Ingestion + adapter mapping
Cleanup + PII review
Recipe pick (decision tree)
Domain pack (when applicable)
Gold set — manual + LLM-assisted
Train / val / test split
Synthetic data drills (recipe-specific playbooks)
Review queue + soft-reject flow
Training configuration
Trainability forecast read
Evaluation against the recipe's eval pack
Shipping (export + deploy + playground smoke-test)