Tutorial 5 · End-to-end · Internal KB

Internal knowledge-base QA via the qa-sft recipe — the simpler path

By the end of this tutorial you'll have a small language model that answers questions from your internal knowledge base — "how do I get VPN access?", "what's the office WiFi password?" — directly from its weights, with no retrieval index alongside. Inference runs around 200ms on a small GPU. The pipeline is intentionally lighter than the rag-protocol path: no citation discipline, no refusal phrase, no BM25 index to maintain. When facts are stable and the corpus is small, this is the right tool. When they're not, this tutorial covers the honest signal that says "graduate to rag-protocol" and the one-click reroute that does it for you.

Level: intermediate Time: ~1.5 hours total (less ceremony than the rag-protocol tutorial) Prerequisites: Tutorial 0 (Setup BrewSLM). Optional companion: Tutorial 1 (Support FAQ with rag-protocol) — this tutorial is the explicit counterpart, picking memorisation where T1 picked retrieval.

Before you start

This tutorial assumes BrewSLM is running locally at http://localhost:5173 with an admin user signed in. If you haven't done that yet, complete Tutorial 0 — Set up BrewSLM and your first project first. It takes ~15 minutes and is the prerequisite for every tutorial in this track.

You'll also want, before you start: 40-80 canonical Q&A pairs from your internal knowledge base. Easiest sources: resolved internal-IT tickets, employee-handbook FAQ pages, Slack-canonical-answer threads (the ones with a thumbs-up emoji on the reply that gets reposted twice a month). If you don't have your own data yet, the public SQuAD dataset or Natural Questions works for validating the workflow before you point at your own corpus.

Terms you'll see in this tutorial (click to expand)
Recipe
The training-plan template you pick when creating a project. For this tutorial: qa-sft. Defines the base model + adapter + eval pack defaults for direct-answer QA — no retrieval index, no citation marker, just (question, answer) supervised fine-tuning.
Adapter
The mapping layer that converts your CSV/JSONL rows into training-ready fields. For qa-sft the adapter is qa-pair — it reads a question column and an expected (answer) column. Rationale is optional.
Task profile
The shape category the platform uses to pick eval handlers and synth playbooks. qa-sft's task profile is instruction_sft — same family as a generic SFT task, with QA-specific defaults layered on top.
Scoring mode
How the eval handler compares prediction to gold. qa-sft uses field_match — the prediction is scored against the single expected field via exact_match + F1, with an LLM-judge gate on top to catch semantic equivalence the string metrics miss.
Memorisation
The model encodes each (question, answer) pair into its weights during training. At inference, the same (or paraphrased) question retrieves the answer from those weights. No index, no retrieval step. The trade: when a fact changes, you retrain.
Canonical answer
The single "official" version of an answer to a question. qa-sft trains on one canonical answer per question — paraphrasing the question is encouraged, paraphrasing the answer is noise.
LLM-judge pass rate
A semantic-equivalence gate. A teacher model reads (gold answer, predicted answer) and scores whether they say the same thing. Catches the case where "Visit Settings → Security" and "Go to Settings, then Security" should both pass even though exact_match says they don't.
Reroute-to-RAG
The platform's one-click escape hatch when the post-eval decision engine detects you're using qa-sft for a workload that wants retrieval. Clones your project as a sibling with runtime_config.rag_first=True, carries your gold set forward, builds the BM25 index. You compare both side-by-side.

This is BrewSLM's canonical workflow for direct-answer QA over a small, stable knowledge base — the kind of corpus an internal-IT team at a 50-engineer company actually maintains. The classic shapes: "how do I get VPN access?", "where do I file expense reports?", "what's the office WiFi password?". These don't change weekly. The answers fit in the model's weights cheaply. You don't need a retrieval index to look them up — you need the model to remember them.

The end state is the model you'd actually deploy at a B2B engineering team: small enough to host on the same box as your Slack-bot worker, accurate enough that the helpdesk channel goes quiet, simple enough that one engineer can own the whole pipeline.

What you'll build

A direct-answer QA assistant. Concretely:

The model is a fine-tuned LoRA adapter on top of SmolLM2-135M-Instruct. Inference is ~200ms on a single GPU, 600-900ms on CPU (still fine for a Slack-bot). No citations in the output (the part rag-protocol does and qa-sft doesn't); when a fact changes, you update the gold set and retrain.

Key idea

qa-sft is the simpler tool. Less ceremony than rag-protocol — no citation discipline, no refusal training, no index to maintain. It memorises facts. When facts change, you retrain. That's the trade. If your corpus is small (under ~100 facts) and stable (changes less often than once a month), this is the right pick. If it isn't, the post-eval decision engine in BrewSLM will tell you to reroute and we'll walk through how to take that recommendation gracefully.

Why a small model (not regex, not a frontier API, not always RAG)

Three options people reach for when building an internal helpdesk bot. Use this comparison:

ApproachHandles paraphrased questions?LatencyCostPrivacy / offline
Regex / keyword Slack-botNo — fails on "how do I VPN" vs "VPN access"<5msFreeSelf-hosted
Frontier LLM via APIYes1.5-3s + queue$0.003-0.02 per query; $300-$3000/month at internal-team volumesEvery employee question leaves your network
Small qa-sft model (this tutorial)Yes (paraphrase playbook drills exactly this)200-900ms~$30/month on a small GPU, $0 on CPUSelf-hosted; runs offline

Regex dies the first time someone types "how the heck do I VPN" — the register your engineers actually use diverges hard from the canonical FAQ wording. Frontier-LLM works but you're paying per question and sending every employee's "what's our parental leave policy" out to a third-party API. The small fine-tuned model is the middle path: handles paraphrase, stays on your hardware, costs nothing per query at runtime.

The remaining decision is qa-sft vs rag-protocol — both are BrewSLM recipes for QA, and they look almost identical from the outside. The recipe-choice section below is the real load-bearing piece of this tutorial: the answer changes based on properties of your corpus, not preferences. Read it carefully before you start building.

Choose your dataset

You need (question, answer) pairs. The answer is the canonical thing the model should say when asked the question. Three common starting points:

Your own knowledge base
Most internal-IT teams already have the data, scattered across three places. Confluence / Notion pages where an H2 is the question and the paragraph below is the answer. Slack threads where someone asked, IT replied, and the reply got a thumbs-up. Resolved tickets in Jira / Linear with the "answer" field populated. Extract Q→A pairs from all three; you'll find 40-80 high-signal ones in any half-mature engineering org.
Public starting points (for warmup)
SQuAD ships 100k QA pairs over Wikipedia paragraphs — the canonical QA benchmark. Natural Questions is Google's dataset of real user questions answered from Wikipedia. Either gets you to a working pipeline in 30 minutes; swap in your real corpus once you've verified the platform end-to-end.
Resolved helpdesk archives
A year of resolved internal-IT tickets already contains canonical Q&A pairs — the question is the ticket title, the answer is the first reply that closed the ticket. Filter for tickets that resolved without follow-up and you've got pre-validated gold candidates.

How many rows do you need?

40-80 canonical Q&A pairs is a good seed. qa-sft is a memorisation task — every gold row is a fact the model commits to weights. The paraphrase synth playbook then multiplies your seed 3-5x by generating variant phrasings of the same questions, so 60 seed rows becomes 200-300 trainable rows.

Ingest and map

In BrewSLM, create a new project: Projects → New Project → qa-sft recipe. The recipe pre-fills the adapter (qa-pair), the task profile (instruction_sft), the scoring mode (field_match), and the eval pack scaffold (exact_match + F1 + LLM-judge gates).

Open Data Studio → Import. Your CSV or JSONL should look like:

question,expected,rationale
"How do I get VPN access?","File a ticket in #it-help with the project you need access to, then run the Tailscale installer that IT sends back.","Standard onboarding flow; the project-scoping prevents broad Tailscale acls."
"What's the office WiFi password?","The network is brewslm-corp, password is in 1Password under Office WiFi. Mobile devices use brewslm-guest with a daily PSK posted in #office.",""
"How do I expense a $30 dinner?","Use Ramp. Photograph the receipt, tag the project code, submit. Under $200 manager approval is automatic.",""

Two required columns (question, expected) and one optional (rationale). Rationale is for your future-self — some eval handlers use it; the qa-sft recipe doesn't require it for training. The mapping picker scans your columns and proposes a binding; click Apply mapping when the preview rows look right.

✓ Checkpoint: the Data Studio Overview now shows your imported row count (e.g. "62 trainable rows"), and the Sources panel lists your imported CSV/JSONL with a green status badge. If the mapping is wrong, the preview will show empty or mis-shaped rows — click Edit mapping on the panel and pick the correct column for each field.

One canonical answer per question

If your data contains the same question with multiple slightly-different answers ("Use Tailscale" / "We use Tailscale, install the desktop app" / "Tailscale, ping #it-help if it doesn't work"), pick one canonical phrasing and drop the rest. Training on three different answers to the same question teaches the model that the question has a distribution of valid responses — which is noise, not signal, when what you actually want is "say this one thing".

Cleanup

Open Data Studio's Quality & Safety panel. Three deterministic scans worth running:

You don't have to clean everything before training; you do have to clean the rows you promote to gold.

Pick the recipe: qa-sft or rag-protocol?

This is the load-bearing decision in the whole tutorial. BrewSLM ships two recipes for question-answering and they look surprisingly similar from the outside. The difference is which property of your data they exploit:

Use this decision matrix:

Question about your corpusPick qa-sftPick rag-protocol
How often do your facts change?Rarely (less than once a month)Frequently (weekly or more)
How big is your corpus?Small — under ~100 factsLarge — over ~500 facts (model can't memorise all of them)
Do you need a citation in the output?No — direct answer is fineYes — auditability / source-of-truth requirements
Do you need clean refusal when the answer isn't known?No — best-effort is acceptableYes — guessing is unacceptable
Will you run offline / air-gapped?Yes — qa-sft works without a retrieval index loadedIf you can load the BM25 index alongside, fine
Do you need a compliance audit trail (which doc produced which answer)?NoYes — citations are the audit trail
How much pipeline complexity are you willing to maintain?Less is better — fewer moving partsOK with index maintenance + protocol-shaped gold

The internal-IT helpdesk inside a 50-engineer company is the canonical qa-sft case. Facts are stable across quarters. The corpus is small — 40-80 canonical Q&As cover 95% of helpdesk volume. Citations aren't required because the answer is itself the source-of-truth from the team that maintains the policy.

An ecommerce FAQ that updates monthly, a legal QA system that touches new statutes, or a customer-support bot covering 600 product SKUs — those are rag-protocol territory. Tutorial 1 walks through that path.

Sticking with qa-sft for this tutorial

If two or more rows in the matrix point at rag-protocol, stop here and switch to Tutorial 1 instead. If they all point at qa-sft (the canonical internal-IT helpdesk case), keep going. The remaining tutorial assumes you've made the qa-sft call deliberately.

Domain packs (optional)

BrewSLM ships a few domain packs (support, legal, ecommerce) that bundle cleaning rules, eval thresholds, and synth-recipe constraints scoped to one vertical. The support pack is the closest match to internal-IT helpdesk — it pre-loads PII patterns common to ticket data.

For most internal-KB use cases domain packs are optional. The platform defaults work fine; the support pack is a refinement you can apply later from Project → Domain → Pack if you find yourself doing the same PII cleanup on every project. The recipe defaults plus the cleanup step above cover everything important.

Build the gold set

The gold set is the trusted reference your model is trained and evaluated against. For qa-sft this is the most important data work you'll do — every gold row is a fact the model commits to memory. Aim for 40-80 rows; quality matters more than quantity.

Path A — manual seeding from your canonical answers

Open the Gold Set workbench (Data Studio → Gold Set). For each row you add:

  1. Write the question the way a real teammate would ask it. If you have Slack logs, copy the actual wording — "how do I get on VPN" beats the textbook "How do I request VPN access?".
  2. Write the expected answer. Make it short, factual, and complete in a single response. If the canonical answer is currently a 12-sentence Confluence page, distill it to the 2-3 sentence version that would fit in a Slack reply.
  3. Optionally fill rationale — a one-line note on why this answer is right. Helps your future self when you're auditing the gold set six months later.

Spend an hour here. The model will inherit the tone, format, and granularity of these rows — casual and short stays casual and short; formal and exhaustive stays formal and exhaustive.

Path B — LLM-assisted promotion from raw exports

If you bulk-imported a few hundred Confluence pages or resolved tickets, use the same "promote from raw" flow other tutorials use:

  1. Bulk-import the raw pages / tickets as candidate rows.
  2. Run a teacher model (Ollama is the free default; OpenAI / Anthropic / DeepSeek work if you have keys) against them with a prompt asking it to extract (question, answer) pairs.
  3. Every extracted pair lands in the synth review queue with review_status="pending". Accept the good ones; they're promoted to gold.

This compresses 3 hours of manual gold-writing into 30 minutes of review. The platform will not auto-accept LLM-generated gold — each promotion is an explicit click.

✓ Checkpoint: the Data Studio Overview's Gold Set ready row should now be green ("60 gold rows ready") or amber ("32 gold rows · 50 recommended"). Amber is fine for now — the paraphrase playbook in the next section will multiply your seed. If the row is grey ("No Gold Set yet"), nothing got promoted — open the Gold Set workbench directly to confirm rows landed there.

Splitting train, validation, test

BrewSLM auto-splits when you click Run prepare now on the Data Studio Prepare Dataset panel. The default ratios are 80/10/10 with a deterministic seed. For qa-sft, uniform random off the canonical config is fine.

This is genuinely different from the other tutorials. Span-extraction cares about template-leakage; code-review cares about author-leakage. qa-sft doesn't have an equivalent leakage trap — questions are independent, answers paraphrase-stable. Random split is honest here.

Override the default ratios only when your gold set is small (under 50 rows): use 70/15/15 so val/test have at least 7-10 rows each. Below that the eval metrics get too noisy to read.

Generate synthetic drills (just paraphrase)

The qa-sft recipe ships two synth playbooks, and the headline one for a fresh project is the paraphrase playbook. This is intentional — qa-sft is the smaller, simpler recipe and it deliberately doesn't ship the modes that classification or span-extraction do. If you've read Tutorial 2 (SQL injection) or Tutorial 3 (invoice extraction), you'll have seen hard-negative and class-balance-fill playbooks; those are not available for qa-sft, and that's by design. There are no "negative classes" in direct-answer QA the way there are in classification — every question has a correct answer; there's nothing to be a "negative" against.

qa_sft_paraphrase — the workhorse (POSITIVES_PARAPHRASE mode)
For each gold row, generate N paraphrases of the QUESTION while keeping the ANSWER verbatim. Goal: the model learns that "how do I get VPN?", "I need VPN access, where do I go?", "VPN setup steps?" all map to the same canonical answer. Generate ~3-5 paraphrases per gold row; a 60-row seed expands to 240-360 trainable rows.
qa_sft_cluster_targeted — for iteration 2 (CLUSTER_TARGETED mode)
After your first training pass + eval, the platform's failure-cluster surface will identify question-types the model is weak on. The cluster-targeted playbook generates new examples specifically aimed at those clusters. Don't run this on iteration 1 — there's no failure cluster yet. It's the right tool for round 2.

Open Data Studio → Synthetic → Playbook Center. The qa-sft recipe surfaces the paraphrase card; click it, set target count to 3-5× your gold size, pick a backend. Generation runs as a background Job — the notification bell tracks progress.

Paraphrase the question, not the answer

The whole point of the paraphrase playbook is to teach the model question-invariance — "the same fact answers many phrasings". If the teacher model paraphrases the ANSWER too, you'll be training the model that the same question can have N slightly-different acceptable answers, which is the noise mode this tutorial keeps warning against. The review queue (next section) is where you catch this.

Review the synth queue

Every paraphrase row lands in the Synthetic Review Queue with review_status="pending". The queue groups rows by source gold row, so you can scan "here are 5 paraphrases of the VPN question" in one screen. Per-row actions:

The key check for qa-sft paraphrase rows is "does the answer match the seed verbatim?". If yes (most rows), accept. If the teacher paraphrased the answer too — even slightly — soft-reject with the answer-paraphrased reason. You want question variation, not answer variation.

Expect to accept ~70-85% on first pass. If you're rejecting more than half, the teacher prompt is drifting and the playbook card's prompt editor lets you tweak the system message.

Training configuration

Open Training → New Experiment. The qa-sft recipe defaults are sensible — for a first run, accept all of them:

Base model
HuggingFaceTB/SmolLM2-135M-Instruct. Small, instruction-tuned, runs on consumer hardware. Alternatives the recipe surfaces: Qwen/Qwen2.5-0.5B-Instruct (slightly better quality at ~4x the size), Qwen/Qwen2.5-3B-Instruct (noticeably better, needs more VRAM).
Adapter
LoRA, rank 16, alpha 32, target modules q_proj,k_proj,v_proj,o_proj. Standard. No separate classification head, no custom output projection — qa-sft uses the base model's standard LM head; the adapter just trains its attention projections.
Learning rate
2e-4. Higher than full fine-tuning because LoRA has fewer trainable parameters.
Epochs
2. Slightly fewer than the classification or span-extraction tutorials because the task is simpler — the model is memorising direct answers, not learning a discriminative boundary. For 250-row training sets, 2 epochs typically hits the loss plateau; 4+ overfits and the model starts emitting verbatim training rows even on questions it shouldn't.
Batch size + gradient accumulation
Batch 4, accumulate 4 → effective batch 16. Adjust down on GPUs under 8 GB.

Expected runtime: 3-10 minutes on a single GPU (RTX 3060 or better), 10-25 minutes on CPU. The training panel shows live loss + a sparkline; if loss isn't dropping after the first 50 steps, kill the run and check your data — the dataset is probably misshapen.

✓ Checkpoint: in the Training tab, your experiment row shows a live sparkline that drops from ~2-3 in the first few steps down to ~0.3-0.6 by the end. The bell shows a "training" notification with a percentage. When complete, the experiment row turns green and the experiment detail page shows the final loss + a "Run evaluation" button.

Read the trainability forecast

Before kicking off the training run, the platform pre-computes a trainability forecast: given your current data + gold set + base model, what's the predicted pass probability? The goal ledger on the Data Studio overview shows it as the predicted_pass row.

qa-sft's pass probability tends to be higher than rag-protocol's for the same-size data set. That's because there's no citation discipline to learn, no canonical refusal phrase to imprint, no protocol-shaped gates — just (question → answer). Healthy targets for a fresh internal-KB project:

If the forecast is below 50%, add data before you train. The goal ledger's blockers panel tells you which component is weakest.

Evaluation: exact_match + F1 + LLM-judge

After training, the platform automatically evaluates against the project's eval pack. For qa-sft projects the scaffolded pack carries four gates:

Exact match ≥ 0.45 (required)
Share of predictions where the model emitted the gold answer verbatim. 0.45 is intentionally not strict — exact_match is a coarse signal for natural-language answers because "Visit Settings → Security" and "Go to Settings, then Security" both say the same thing but differ on every token. exact_match catches gross paraphrase drift; the LLM-judge gate below catches the semantic case.
F1 ≥ 0.60 (required)
Token-overlap F1 between prediction and gold. A middle ground — partial credit for predictions that contain the right content words but reorder them. Standard SQuAD-style F1.
LLM-judge pass rate ≥ 0.75 (required)
A teacher model reads (gold, prediction) and scores whether they say the same thing semantically. This is the gate that actually matters for natural-language QA. It catches the "Visit Settings → Security" / "Go to Settings, then Security" case as a pass even though exact_match fails. It also catches the failure mode where exact_match passes (model memorised the gold verbatim) but the answer is wrong in context.
Safety pass rate ≥ 0.93 (not required)
Refusal / off-topic / adversarial input handling. Marked non-required for qa-sft because the recipe doesn't train refusal behaviour — if you need this gate active, it's a signal you're in rag-protocol territory.

The goal ledger's eval_pass_rate row expands into the per-gate breakdown so you see exactly which gate failed — "exact_match 0.41 / ≥ 0.45 failed, f1 0.63 / ≥ 0.60 passed, llm_judge 0.78 / ≥ 0.75 passed". The exact_match-vs-LLM-judge split is the signal you'll learn to read.

When the eval fails — and the reroute escape hatch

Common qa-sft failure patterns and the fix for each:

SymptomRoot causeFix
LLM-judge passes (≥ 0.80) but exact_match low (≤ 0.30)Model paraphrases answers — semantically right but emits its own wording rather than your canonical versionUsually acceptable for a helpdesk bot. If you need verbatim, add the canonical wording as a system-prompt prefix at deploy time and retrain with the prefix in the gold question.
exact_match high but LLM-judge lowModel memorised the gold rows verbatim but the answer is wrong in context — overfitting on the training distributionReduce epochs from 2 to 1; add more paraphrase rows; check for stale gold answers that are no longer correct.
Both metrics mediocre AND post-eval engine recommends reroute-to-RAGYour workload is retrieval-shaped — too many distinct facts to memorise, or facts that need citationWalk the reroute trace below and accept the recommendation.
Model invents facts on questions outside the KBqa-sft doesn't train refusal behaviour — when asked something it doesn't know, the model generates a plausible-looking but fabricated answerThis is a known qa-sft limitation. The answer is to reroute to rag-protocol, which DOES train refusal. Don't try to hand-train refusal into qa-sft — that's working against the recipe.

The reroute-to-RAG flow (when memorisation stops being enough)

BrewSLM's post-eval decision engine looks at your eval result and the shape of your gold set. When it sees retrieval-shaped signals — low pass rate combined with high answer-diversity in your gold (many distinct factual answers) — it surfaces a recommendation on the Decisions panel: "You're memorising what should be retrieved. Reroute to rag-protocol?"

Expand the "Why this fired?" disclosure. You'll see the actual numbers ("answer-diversity ratio 0.82 ≥ 0.70 threshold", "exact_match 0.31 < 0.45 gate"). If the trace matches your read of the data, click "Switch to RAG (keeps your gold set)". The platform runs clone_project_for_rag, which does three concrete things:

  1. Creates a sibling project with your project name + " (RAG)" suffix. The new project's parent_project_id points back at the qa-sft project.
  2. Forces runtime_config.rag_first=True and auto_rag.enabled=True on the clone. At inference time the playground uses the BASE model (no LoRA) plus BM25 retrieval — no new training step.
  3. Carries your gold set + prepared splits forward and builds the BM25 index immediately, so the sibling's playground works the moment the clone completes.

The original qa-sft project stays put — nothing is destroyed. You now have two projects side-by-side. Open the playground on each, ask the same 10 questions, compare. If you accept the reroute, the rest of the workflow is Tutorial 1's — gold set, cleanup, and deployment pattern all carry forward; what changes is the recipe and the eval pack.

The honest move

The reroute recommendation is one of BrewSLM's better moments. The platform isn't trying to keep you on qa-sft because that's the project you started with — it'll volunteer that a different recipe might do better, with the trace to back the call. Take the recommendation seriously. If the trace looks right, accept it. The qa-sft project doesn't go away; you compare both and ship whichever wins.

Ship the model

Once the eval pack passes, ship in three steps:

  1. Export the LoRA adapter. Open Models → Export. The platform writes the adapter weights, the tokenizer config, and a deploy manifest into data/projects/<id>/exports/. The adapter is ~5-15 MB; the base model is loaded fresh at deploy time.
  2. Deploy via vLLM (or Ollama). The recipe's target_profile is vllm_server by default. The export bundle includes a vLLM launch script:
    cd data/projects/<id>/exports/run-2026-06-05
    ./deploy-vllm.sh
    # Serves the base model + LoRA adapter on localhost:8000
    # Exposes the OpenAI-compatible /v1/chat/completions endpoint
    Ollama variant: ./deploy-ollama.sh. Either gives you a standard chat-completions endpoint on a local port.
  3. Wrap in your own /ask microservice. BrewSLM doesn't expose a hosted /ask endpoint — you wrap the chat-completions endpoint in a thin HTTP service that handles the request shaping (your system prompt, your auth, your rate limiting). For a Slack-bot use case the wrapper is ~30 lines of Python: receive a Slack event, call /v1/chat/completions with the user's question, post the response back to the channel. Same deploy pattern other tutorials use; nothing qa-sft-specific.

Smoke-test in the platform's playground first

Before wiring the model into Slack, open the BrewSLM playground for the project. Ask 10 real internal-IT questions you'd see in #it-help. Check the answers; spot-check for the failure modes from the previous section (paraphrased answers, invented facts on out-of-KB questions). The playground is your last-mile sanity check before the deployed model starts producing answers that look authoritative because they're in Slack.

What's next

You have a deployed direct-answer internal-KB assistant. Three obvious next moves:

Refresh every N weeks when the KB changes
qa-sft memorises facts. When facts change, you retrain. Set a recurring calendar item — every 4-8 weeks for an internal-IT helpdesk. Update the gold set with new canonical answers, re-run the paraphrase playbook on the new rows, retrain. < 30 minutes once you've done it once.
Grow the corpus thoughtfully (don't just dump everything in)
The temptation when the model works is to throw every Confluence page at it. Resist. Each added row is a fact the model has to keep memorised; pile on too many and the memorisation pressure costs accuracy on the original rows. Add new gold rows when you have a real "we got this question N times and didn't have a good answer" signal.
Watch the reroute signal — the honest "you've outgrown qa-sft" tripwire
When your corpus passes ~200 facts OR your KB updates more than once a month, the post-eval decision engine usually fires the reroute-to-RAG recommendation. Accept it when it fires. qa-sft was the right tool when the corpus was small and stable; rag-protocol is the right tool when it isn't. Walk through Tutorial 1 from that point — your gold set is already carried forward.

The next tutorial in this series picks a different recipe shape: invoice field extraction with the span-extraction recipe — structured output with character offsets, JSON-schema gold sets, and the span-set eval handler. Different shape, same end-to-end workflow.

Key terms

qa-sft recipe
BrewSLM recipe that trains a small model to directly answer questions from memorised facts. No retrieval index, no citation marker, no canonical refusal phrase. The simpler counterpart to rag-protocol; right tool for small, stable knowledge bases.
Canonical answer
The single official version of an answer to a question. qa-sft trains on one canonical answer per question — paraphrasing the question (across many phrasings) is encouraged; paraphrasing the answer is noise.
field_match scoring
The eval-handler mode qa-sft uses. Compares prediction to a single expected field via exact_match + F1, with an LLM-judge gate layered on top for semantic equivalence.
LLM-judge pass rate
The gate that actually matters for natural-language QA — a teacher model reads (gold, prediction) and scores whether they say the same thing. Catches paraphrase that exact_match misses; required at ≥ 0.75 in the qa-sft scaffolded eval pack.
Reroute-to-RAG
One-click platform flow that clones the qa-sft project as a RAG-first sibling. The clone has runtime_config.rag_first=True, carries the gold set forward, and builds the BM25 index immediately. The original qa-sft project stays put. The honest escape hatch when memorisation stops being the right strategy.
Memorisation pressure
The trade-off qa-sft makes. Every gold row is a fact the model commits to weights. Up to ~100 facts a small model handles cleanly; past ~200-500 the model starts dropping signal on earlier rows to make room for new ones. That ceiling is the natural reroute-to-RAG trigger.

Check yourself

Answers are saved to this browser.

← All tutorials