Build a PII span tagger with the span-extraction recipe
By the end of this tutorial you'll have a small language model that takes a piece of free-form text — a support transcript, a chat log, an internal email, a training-data row — and emits a list of every PII span it contains, each tagged with its type. A downstream redactor uses that list to scrub, hash, or pseudonymise the spans before the text leaves your network. The tagger runs on a single small GPU (or CPU), costs nothing per call at inference, and unlike sending PII to a frontier API to "redact it for you", it never leaks the data it's trying to protect.
Before you start
This tutorial assumes BrewSLM is running locally at http://localhost:5173 with an admin user signed in. If you haven't done that yet, complete Tutorial 0 — Set up BrewSLM and your first project first. It takes ~15 minutes and is the prerequisite for every tutorial in this track.
This tutorial is the natural companion to Tutorial 3 — Invoice field extraction. Both use the span-extraction recipe. If you've done T3 the shape will feel familiar; the new work is per-entity recall tuning and the explicit-review redaction pattern. If you haven't done T3, you can still follow this one cold — the recipe is reintroduced from scratch.
You'll also want, before you start: ~200 rows of text with annotated PII spans. Crucially these should be synthetic, not real — Faker-generated names, addresses, emails, etc. embedded inside realistic carrier sentences. The "don't train on real PII" callout below is not a stylistic note; it's the rule.
Terms you'll see in this tutorial (click to expand)
- Recipe
- The training-plan template you pick when creating a project. For this tutorial:
span-extraction. Defines the base model + adapter + eval pack defaults for structured-span tagging. - Span
- A contiguous slice of the input text identified by
(label, span_start, span_end, value). "alice@example.com" in a 240-character message is one span; if there are three more emails further down, there are four spans total. - Open-set tagging
- The number of spans per input is unbounded — zero, one, or thirty all valid. Distinct from T3's closed-set extraction (exactly one "total" per invoice). Open-set tagging puts the weight on recall; closed-set puts it on precision.
- Entity type
- The label attached to a span:
EMAIL,PHONE,PERSON,ADDRESS,SSN,CREDIT_CARD,DOB. The choice of vocabulary is your call; pick what your downstream redactor needs to differentiate. - span_set scoring
- The scoring_mode the span-extraction recipe uses. Treats each row's prediction and gold as sets of spans, computes intersection (true positives), reports span-F1, span-precision, span-recall. The same scorer drives invoice extraction and PII tagging.
- Per-entity recall floor
- An eval gate that requires recall on each entity type to meet a minimum threshold individually. For PII: SSN and credit-card recall floors are tightened to 0.98 because missing one is a compliance incident; PERSON and DOB floors sit at 0.85 because the FP cost on those classes is higher.
- Hard negative (PII flavour)
- A string that looks like a PII span but isn't. "May" in "May 2024" is not a PERSON. "555" in "555 5th Avenue" is not a PHONE. "Pat" in "Pat the dog" is not a PERSON. Hard negatives teach the model to use context, not pattern matching.
- Explicit-review redaction
- The deployment pattern this tutorial recommends: tagger emits spans, downstream pipeline produces a redacted copy AND keeps the original alongside an audit log, redaction is approved by a human (or by policy) before either copy is destroyed. The platform's safety rule is the same: no auto-redaction without an audit trail.
- Faker
- Faker is a Python library that generates realistic-looking but entirely synthetic names, addresses, phone numbers, emails, SSNs, credit-card numbers, dates of birth. It is the canonical source-of-truth for "PII-shaped strings I can legally train on."
This is BrewSLM's canonical workflow for an open-set, multi-occurrence span tagger deployed as a pre-API safety control. The use case is PII detection but the recipe shape generalises: secret-key detection in code, internal-identifier scrubbing in logs, controlled-substance mentions in clinical notes, named-entity tagging for any policy that needs to know where in the text the sensitive content sits. The bones are the same; the entity vocabulary changes.
The end state is a tagger you'd actually deploy at a B2B SaaS company: small enough to sit inline before every outbound API call, accurate enough that the compliance team trusts its output, and honest enough about its limits that nothing gets auto-redacted without leaving a paper trail.
Do not train on real PII. Ever.
This is the recurring compliance trap. The training pipeline persists your data to disk, ships it through synthetic playbooks (which call out to teacher models), and writes evaluation reports that quote rows verbatim. Every one of those steps is a leak surface for real PII. Use Faker-generated synthetic names, addresses, phone numbers, emails, SSNs, and credit cards embedded inside real-shape carrier text. The tagger will learn the shapes just fine; nothing real ever lands in data/projects/. Live testing against your own corpus is fine — but only with a deployed adapter, never as a training input.
What you'll build
An open-set PII tagger. Input is free-form text; output is a JSON array of span objects:
{
"text": "Hi Sarah, please confirm — alice@example.com asked us to wire $1,200 to her on 03/14. Her phone is (415) 555-0142 and SSN ends in 4419. Best, Marco",
"entities": [
{ "label": "PERSON", "span_start": 3, "span_end": 8, "value": "Sarah" },
{ "label": "EMAIL", "span_start": 27, "span_end": 44, "value": "alice@example.com" },
{ "label": "DOB", "span_start": 82, "span_end": 87, "value": "03/14" },
{ "label": "PHONE", "span_start": 101, "span_end": 115, "value": "(415) 555-0142" },
{ "label": "PERSON", "span_start": 152, "span_end": 157, "value": "Marco" }
]
}
The model is a fine-tuned LoRA adapter on top of SmolLM2-135M-Instruct running at typical 20-80ms per piece of text on a single GPU, ~100-300ms on CPU. The output is a strict JSON array — your downstream redactor consumes it as-is, masks the spans (replace with [EMAIL], hash, or pseudonymise via Faker), and writes both the redacted and the original alongside an audit log entry. No PII leaves the network until somebody approves the redaction.
Key idea
PII tagging cares about recall on the rare classes. Missing one of three emails is a small problem; missing one SSN is a compliance incident. The eval pack you ship with overrides the default uniform thresholds with per-entity recall floors that are tightest on the rarest, costliest entity types. SSN and credit-card recall floors sit at 0.98; PERSON and DOB at 0.85. That asymmetry is the deployment-readiness signal that matters.
Why a small model (not regex, not a frontier LLM)
Three options exist for PII tagging. Use this comparison:
| Approach | Detection quality | Latency | Cost | Privacy |
|---|---|---|---|---|
| Regex stack (presidio, custom rules) | Catches well-formed emails / phones / SSNs; misses obfuscation ("alice at example dot com", "fifteen oh three") and over-flags name-shaped words ("May", "Pat", "Will") | <5ms | Free | Self-hosted |
| Frontier LLM via API | Excellent on obfuscation and context | 1.5-3 seconds + queue | $0.003-0.02 per call | You are sending the PII you want to redact to the API. This is the privacy paradox — the cure leaks the disease. |
| Small fine-tuned span tagger (this tutorial) | Good on obfuscation + low FP on common names if you curate hard negatives | 20-80ms | ~$30/month at 10k/hour (one GPU) | Self-hosted. The PII never leaves your network. |
Regex catches well-formed PII but misses the obfuscation any agent who's used a chat product knows about — "my email is alice at example dot com", "phone five five five oh one four two". Worse, regex over-flags: a strict name-list flags "May" in "May 2024", "Pat" in "Pat the dog", "Will" in "Will do." Frontier LLMs handle context well but require you to ship every byte of the data you're trying to protect to a third-party API — the privacy paradox. Small fine-tuned span taggers occupy the gap: they learn the linguistic context that distinguishes "Pat"-the-name from "Pat"-the-verb, and they run on your hardware.
Choose your dataset
You need text with annotated PII spans. Critically: synthetic PII, real carrier text. Three sources to mix:
- Faker-generated synthetic PII (the spine)
- Faker generates names, emails, phones, addresses, SSNs, credit cards, dates of birth that look real but aren't. Write a short Python script that takes real-shape carrier sentences ("Hi {name}, please confirm — {email} asked us to wire $1,200 to {pronoun} on {dob}…") and slots Faker outputs into the placeholders. You know exactly where every span starts and ends because you wrote the template. 100+ Faker-templated rows is the easiest, safest gold-set spine you can build.
- Public NER corpora (warm-up for general types)
- OntoNotes 5.0 ships PERSON, LOC, ORG, DATE tags over news, broadcast, and web text — useful for learning the linguistic context around names and dates without any real-PII exposure. CoNLL-2003 is the canonical NER baseline if you want a 4-class warm-up (PER / LOC / ORG / MISC). WikiNER is a Wikipedia-derived public NER corpus in many languages — useful for multilingual coverage. Map their tag vocabularies into your project's: OntoNotes PERSON → your PERSON, DATE → your DOB or drop.
- Presidio's synthetic test corpus
- Microsoft's Presidio ships a synthetic PII test corpus with realistic carrier text and annotations for 20+ entity types. It's released for exactly the "I can't ship real PII into a training pipeline" reason. Useful as a second seed alongside your Faker spine.
Do not use real customer data, even partially redacted
"I'll just remove the last four digits of the SSN before training" leaks the prefix bytes — and the prefix is geographically identifying. "I'll just keep names but anonymise emails" leaks the names. Half-measures do not solve the problem; they only hide it from your future self when the audit comes. Use synthetic PII. Period. Live testing against your own corpus is fine, but only with a trained adapter loaded for inference — never as a training input.
Ingest and map
In BrewSLM, create a new project: Projects → New Project → span-extraction recipe. The recipe pre-fills the adapter (default-canonical), task profile (structured_extraction), scoring mode (span_set), and the eval pack scaffold with span-F1, span-precision, span-recall gates.
Open Data Studio → Import. Your JSONL should look like this — one row per piece of text, entities encoded as a JSON array:
{"text": "Hi Sarah, please confirm — alice@example.com asked us to wire $1,200 to her on 03/14.", "entities": [{"label": "PERSON", "span_start": 3, "span_end": 8, "value": "Sarah"}, {"label": "EMAIL", "span_start": 27, "span_end": 44, "value": "alice@example.com"}, {"label": "DOB", "span_start": 82, "span_end": 87, "value": "03/14"}]}
{"text": "Customer 4532-1289-4419-8821 was charged twice on 2024-03-14. Refund issued to marco.rossi@example.org.", "entities": [{"label": "CREDIT_CARD", "span_start": 9, "span_end": 28, "value": "4532-1289-4419-8821"}, {"label": "DOB", "span_start": 50, "span_end": 60, "value": "2024-03-14"}, {"label": "EMAIL", "span_start": 80, "span_end": 103, "value": "marco.rossi@example.org"}]}
{"text": "Please confirm the export menu doesn't include any sensitive fields — May is fine to handle this.", "entities": []}
Note the third row: "entities": [] is a perfectly valid label. Empty-span rows teach the model that not every piece of text has PII, and that "May" in "May is fine to handle this" is not a PERSON. These are the open-set negative rows; they're as important as the positive rows.
The Data Studio mapping panel shows you a confidence-scored preview of the spans pulled out of three to five rows. Click Apply mapping when the preview looks right.
✓ Checkpoint: the Data Studio Overview now shows your imported row count plus a per-entity-type breakdown ("PERSON: 240 spans across 180 rows, EMAIL: 95 spans across 80 rows, SSN: 18 spans across 18 rows"). The breakdown surfaces immediately whether your gold set is starving any entity type — the rare classes are exactly the ones you'll need to backfill via synth.
Same recipe as the invoice tutorial — different annotation density
If you did Tutorial 3 (Invoice field extraction), this ingestion shape is identical. T3 is closed-set, structured extraction — every invoice has exactly one total, one vendor, one invoice_date; the model's job is to find which span is which. T7 (this tutorial) is open-set, multi-occurrence tagging — a single message can contain zero emails, three emails, or thirty emails, all valid. Same JSON shape, same span_set scorer, but the failure modes look completely different. T3 cares most about precision on the single right answer; T7 cares most about recall on the rare classes.
Cleanup and normalisation
Open Data Studio's Quality & Safety panel. For a span-tagging project the cleanup checks are different from a classification or QA project:
- Unicode normalisation. Smart quotes vs straight quotes, curly apostrophes inside names ("D'Arcy" vs "D'Arcy"), full-width characters from CJK-mixed text. Normalise to NFC before training. If "alice@example.com" is encoded as ASCII in row 1 and as full-width-at "alice@example.com" in row 50, the model sees two different shapes and you halve your effective email-class data.
- Span offset integrity. Every
(span_start, span_end)pair must slice into the canonicaltextfield and equal thevalue. The platform validates this on import; any row where the slice doesn't match its value gets flagged. Fix by re-tokenising or by editing the offsets — never silently drop the row. - Partial redactions in the source. Text that's already been partially redacted ("Customer [REDACTED] called about order #4419") is poisonous. Either the redaction marker is a span you want tagged (in which case label it consistently), or you exclude those rows. A model trained on text where some PII is real and some is
[REDACTED]learns to predict the literal string[REDACTED]as the answer — useless on fresh inputs. - Encoding hygiene. Strip BOMs, normalise whitespace (tabs vs spaces inside addresses), collapse trailing punctuation onto the preceding token where reasonable. Clean text = stable span offsets.
You don't have to clean everything before training, but you do have to clean the rows you're about to promote to gold. The eval pack scores against those rows; the synthetic playbooks seed from them.
Pick the recipe: span-extraction vs classification vs qa-sft
BrewSLM ships three recipes that could plausibly do PII detection. Use this decision tree:
| You want… | Use | Why |
|---|---|---|
| Per-PII-type spans with offsets, so a redactor can mask them | span-extraction | Open-set, multi-occurrence span output; downstream consumer slices the text by offsets |
| A yes/no "this message contains PII" | classification | If you only need a routing flag (PII or not), not the locations, classification is simpler |
| A natural-language explanation of what was found and why | qa-sft | Free-text generation; useful for SOC tooling but not for inline redaction |
| Structured extraction of a fixed-shape document (one total, one vendor) | span-extraction (closed-set) | Same recipe; closed-set just means there's one of each type per row instead of zero-or-many |
For pre-API redaction: span-extraction. If you did the invoice tutorial, this recipe will feel familiar — same shape, different annotation density. Sticking with span-extraction for the rest of this tutorial.
Domain packs (the safety gap)
BrewSLM doesn't ship a PII or safety domain pack out of the box today — the platform's curated packs (legal, support, ecommerce, healthcare) are around content domains, not safety controls. For PII tagging you're operating on platform defaults.
The existing safety eval handler still complements this work. When you wire the tagger into a chat or support surface, the safety handler scores per-turn refusal behaviour and policy compliance independently of the span output. That gives you two distinct guards: the span tagger catches PII inside the text; the safety handler catches policy violations in the surrounding conversation flow.
Building a custom privacy domain pack is a worthwhile follow-up that this tutorial intentionally doesn't cover. It would bundle: tighter per-entity recall floors for SSN/credit-card, an explicit-review redaction policy as a default eval gate, a glossary linking each PII type to the relevant compliance regime (GDPR Art. 4, CCPA §1798.140, HIPAA §164.514), and an Academy tag pointing at this tutorial. If you're shipping span taggers across many projects (PII, secret-key scrubbing, controlled-substance tagging), packaging the conventions as a pack pays back fast.
Build the gold set
The gold set is where the work happens for any span project, but for PII tagging two things make it different from T3 (invoice):
- The PII distribution per row is highly variable — one row has zero PII, the next has fifteen. T3 has exactly one of each type per row.
- Some entity types are much rarer than others. EMAIL and PERSON show up in nearly every row; SSN and credit-card show up in maybe one in fifteen. You need to oversample the rare classes during gold curation.
Path A — manual annotation in the Gold Set workbench
Open Data Studio → Gold Set. The span-tagging mode is the same UI as T3: click-drag to select a span in the text, pick a label from the drop-down, hit save. For each row you add:
- Paste a piece of synthetic-PII-bearing text into the
textfield. Use carrier sentences that look like real chat / email / support data; Faker-fill the PII placeholders. - Click-drag every PII span. Label each one with the entity type from your vocabulary.
- Save. The workbench re-renders the text with the annotated spans highlighted; verify the slice matches the value before moving on.
Spend the first 60 minutes hand-annotating ~80 rows. Make sure every entity type shows up at least 10 times in this batch — if SSN only shows up once, the model never learns the linguistic context around SSNs.
Path B — LLM-assisted promotion using Presidio as the teacher
For larger imports, run Presidio out-of-band as a span pre-labeller and import the results as candidate rows — Presidio is a separate library, not a BrewSLM teacher backend, but its regex engine catches the well-formed cases reliably and that's a great starting point:
- Bulk-import a few hundred unlabeled text rows (synthetic carrier text with Faker-filled PII).
- Run Presidio on each row in a separate Python script. Take its
(entity_type, start, end)tuples and write them out as candidate rows in the platform's canonical entity-JSON shape. - Re-import those candidate rows as a fresh dataset. They land with normal pending status; review one entity-type cluster at a time. The Quality & Safety panel's per-entity-type breakdown groups them so you can rip through all the EMAIL candidates in one pass, then all the PHONE candidates, etc.
This compresses 4 hours of manual annotation into ~45 minutes of review. The trade-off is the one the platform's safety rule names: Presidio will mis-label obfuscated PII, will miss novel phone formats, will mistakenly flag dates as DOBs when they're invoice dates. You have to actually look at every row before accepting it.
✓ Checkpoint: the Data Studio Overview's Gold Set ready row should now show your gold count with a per-entity breakdown ("200 gold rows · EMAIL ×140, PERSON ×210, PHONE ×95, ADDRESS ×52, SSN ×24, CREDIT_CARD ×18, DOB ×120"). The amber state on the rare classes is the signal — SSN and credit-card at 18-24 spans is too thin; the synth step will backfill those.
Don't skip the empty-entities rows
Your gold set must contain rows where the answer is "entities": []. Add ~20% of your gold as empty-entity rows — text that obviously contains no PII, plus text that contains hard-negative tokens that look like PII but aren't (May/Pat/Will-as-words, 555 in non-phone context, dates-that-aren't-DOBs). Without empty-entity rows the model never learns when NOT to tag, and your false-positive rate climbs into the unusable range.
Stratified split by document type
BrewSLM's Prepare Dataset panel produces a random split off the canonical config (train_ratio / val_ratio / test_ratio / seed). For PII tagging, random splitting can starve a rare entity type entirely from val/test (a 0.95 SSN-recall claim is meaningless if test has no SSN spans). Two things to do before import to get a sensible split:
- Pre-partition by document type. Chat messages, internal emails, support transcripts, and ticket comments all have different PII densities. Sort your rows by document_type and break them into train.jsonl / val.jsonl / test.jsonl so each split contains a proportional mix of types, then import the three files as three datasets.
- Verify rare-class coverage per split. Count SSN / CREDIT_CARD spans in each split before training. If your val.jsonl has zero SSNs, hand-move a few rows over from train.jsonl. The Quality & Safety panel surfaces the per-entity counts once you've imported — eyeball them before you click Run prepare.
The default ratios are 80% train / 10% validation / 10% test. For 300 gold rows that's 240 train / 30 val / 30 test, which is enough to detect the obvious failure modes provided rare classes appear in every split. For deployment-readiness you'll also want a separate realistic eval set — see the Evaluate section for how this differs from the random-split test.
Generate synthetic drills
The span-extraction recipe ships three playbooks. For PII tagging all three matter, but the order is non-obvious:
- span_extraction_hard_negatives — the precision-defender drill
- Generates rows where the carrier text contains look-alike strings that should not be tagged. The model is forced to use linguistic context to discriminate. Examples this playbook produces:
- "May 2024 revenue was up 12%" — "May" is not a PERSON
- "555 5th Avenue, suite 12" — "555" is not a PHONE
- "Pat the dog, then check the file" — "Pat" is not a PERSON
- "Will do — see you tomorrow" — "Will" is not a PERSON
- "Customer ID 4532-1289-4419-8821 was charged" — the digits look like a credit card BUT the carrier sentence labels it as a customer ID, not a card; the right answer depends on your policy. Decide once and be consistent.
- span_extraction_paraphrase — the carrier-text variation drill
- Holds the PII spans constant and varies the surrounding carrier text. "Hi {name}, please email {email}" becomes "Could you reach out to {name} at {email}?" becomes "Forward this to {name} ({email})." Same PII, different registers, different punctuation. Goal: the tagger learns to find PII regardless of how the sentence around it is phrased. Generate ~50 rows seeded from your hand-curated gold.
- span_extraction_cluster_targeted — the gap-filler
- If the failure-clusters panel after a first eval round shows the model dropping a specific shape (e.g. "missed obfuscated emails" or "missed phones in 555-555-5555 format"), this playbook generates more examples targeted at that cluster. Generate ~30 rows per cluster you want to backfill. Optional — run only after you've seen the eval failure modes.
Open Data Studio → Synthetic → Playbook Center. The span-extraction recipe surfaces three playbook cards; click span_extraction_hard_negatives first, set target count to 60, pick a backend (Ollama is the free default; OpenAI / Anthropic also work if you have keys). Generation runs as a background Job; the notification bell tracks progress.
Hard negatives first, paraphrase second, cluster-targeted last
Run span_extraction_hard_negatives first — that's where your precision lives. Review the output, then run span_extraction_paraphrase as a recall extender. Hold span_extraction_cluster_targeted in reserve for the second training round, once the eval has surfaced which entity type or carrier-shape is failing. Doing them in the wrong order means you'll be reviewing easy paraphrases while the hard negatives that actually move precision haven't been generated yet.
Faker stays in the loop for the synth step too. The platform's hard-negatives playbook ships a generic prompt — for PII work specifically, customise the prompt (the playbook's prompt text lives in the project's synth config) to tell the teacher to slot Faker-generated name/email/phone strings into the look-alike carrier text rather than inventing strings. That keeps the synth data PII-free at the source even when the teacher is a third-party API.
Review the synth queue
Every generated row lands in the Synthetic Review Queue with review_status="pending". For span-tagging the per-row action is more nuanced than for classification — each row has zero-or-many spans, and any one of them can be wrong. The platform groups rejected rows by reason; use these tags:
false-positive— the synth produced a span that shouldn't be tagged at all (e.g. tagged "May" as PERSON). Soft-reject; useful as a hard-negative seed for the next playbook round.false-negative— the synth row contains a PII span that wasn't tagged (e.g. an email in the carrier text that the teacher missed). Soft-reject; flag for hand-fixup.wrong-label— the span was found, but tagged as the wrong entity type (e.g. an email got tagged as PERSON because the carrier sentence said "Email Sarah at sarah@example.com" and the teacher conflated). Soft-reject; tag for review.offset-drift— the span's start/end don't match its value (e.g. span_start is off by one because the teacher counted a leading space). Soft-reject; the platform's import validator usually catches this but synth output sometimes slips through.
Per-row actions:
- Accept — the row joins training on the next dataset prep run. For PII specifically: only accept if you've eyeballed every span in the row. A row with three correct spans and one wrong-label span is a poisoned row; either fix the wrong label inline or soft-reject the whole row.
- Reject (soft) — the row stays on disk with the reason tag, available for audit.
- Purge — a reason-grouped bulk delete. Once a category is genuinely bad (say all 12
offset-driftrows are wrong), select the reason group and bulk-purge. The platform's "rejected rows are selectable + bulk-droppable" pattern matters here — never all-or-nothing.
Expect to reject 30-50% of generated hard-negative rows on the first pass. The acceptance rate climbs as you tune the playbook prompt and the teacher learns the conventions.
Training configuration
Open Training → New Experiment. The span-extraction recipe defaults are sensible:
- Base model
HuggingFaceTB/SmolLM2-135M-Instruct. Small, instruction-tuned, runs on consumer hardware and emits the JSON-array output cleanly. Alternative:Qwen/Qwen2.5-0.5B-Instructfor slightly better quality on longer carrier text (chats over ~200 tokens).- Adapter
- LoRA, rank 16, alpha 32, target modules
q_proj,k_proj,v_proj,o_proj. Standard for SmolLM2 on structured-output tasks. - Learning rate
- 2e-4. Same as T1/T2; LoRA tolerates this rate.
- Epochs
- 4. Span tagging usually needs one more epoch than classification because the model is learning a structured JSON output, not a single label.
- Batch size + gradient accumulation
- Batch 4, accumulate 4 → effective batch 16. PII carrier text tends to be 100-300 tokens; the per-step memory footprint fits well under 8 GB.
Expected runtime: 8-20 minutes on a single GPU (RTX 3060 or better), 20-45 minutes on CPU. The training panel shows live loss + a sparkline + the kill switch; if loss isn't dropping after the first 100 steps, kill the run and check your data — usually it's a span-offset integrity issue that the import validator didn't catch.
✓ Checkpoint: in the Training tab, your experiment row shows a live sparkline that drops from ~2-3 in the first few steps down to ~0.4-0.6 by the end. The bell shows a "training" notification with a percentage. When complete, the experiment row turns green and the experiment detail page shows the final loss + span-F1 on the validation set + a "Run evaluation" button.
Read the trainability forecast
Before kicking off a real training run, the platform pre-computes a trainability forecast: given your current data + gold set + base model, what's the predicted F1 / pass rate? The goal ledger on the Data Studio overview shows it as the predicted_pass row.
For PII tagging — a high-recall use case — you want:
- Predicted pass probability ≥ 70%. Lower means your rare-class data is too thin; back-fill SSN / credit-card / DOB before training.
- Data ready = met with every entity type ≥ 30 spans in the gold set. Below that, the model can't learn the linguistic context for the starving class.
- Goal ledger's
gold_setrow at ≥ 100%. The span-extraction scaffold defaults to 100 gold rows; for PII with 6-7 entity types you'll want 200-300 to give the rare classes enough signal.
If the forecast is below 50%, the eval is almost certainly going to fail the per-entity recall floors. Spend the extra hour back-filling rare-class rows instead of training. The single biggest predictor of a successful PII tagger is per-entity coverage in the gold set, not the training hyperparameters.
Evaluation: per-entity recall floors are the point
After training, the platform automatically evaluates against the project's eval pack. The span-extraction scaffold ships with these default gates:
- min_span_set_f1 ≥ 0.65
- Headline F1 averaged across all entity types and rows. The starting threshold for a usable tagger; below this the model is broken at the basic shape.
- min_span_set_precision ≥ 0.70
- Across all entity types, fraction of predicted spans that match a gold span (with exact offset + label).
- min_span_set_recall ≥ 0.60
- Across all entity types, fraction of gold spans that the model recovered. The starting threshold; for PII you'll override this per-entity (see below).
- safety pass rate (optional)
- Hooks the existing safety eval handler; useful when the tagger is deployed alongside a chat surface where the safety handler also runs.
For PII tagging, override the uniform recall floor with per-entity floors via a custom eval pack — copy the scaffolded pack JSON, add gates that reference per-entity recall/precision values from the span-extraction eval handler (e.g. recall.SSN, precision.PERSON), and select your custom pack from the project's eval pack picker. A starting policy:
| Entity type | Min recall | Min precision | Why |
|---|---|---|---|
| SSN | 0.98 | 0.90 | Missing one SSN is a compliance incident; false-positives are cheap (one false flag, one human reviews, easy) |
| CREDIT_CARD | 0.98 | 0.90 | PCI scope. Same reasoning. |
| 0.92 | 0.92 | High frequency, well-formed; both sides should be tight | |
| PHONE | 0.90 | 0.85 | Format variation hurts recall; OK to err on the side of over-flagging "555" strings |
| ADDRESS | 0.85 | 0.80 | Span boundaries are inherently fuzzy ("123 Main St" vs "123 Main St, Apt 4B"); allow more slack |
| PERSON | 0.85 | 0.85 | High false-positive cost (May / Pat / Will as words); precision floor matters more than recall here |
| DOB | 0.85 | 0.80 | Date format ambiguity (is "03/14" a DOB or an invoice date?) limits achievable precision |
The goal ledger's eval_pass_rate row expands into the per-gate breakdown so you can see exactly which entity type is failing which floor — "EMAIL recall 0.94 / ≥ 0.92 passed, SSN recall 0.91 / ≥ 0.98 FAILED, PERSON precision 0.81 / ≥ 0.85 FAILED."
Why recall matters more than precision for PII
If your tagger has 0.95 recall on SSNs and 0.95 precision, you miss 5% of SSNs and you falsely flag 5% of non-SSN strings as SSNs. Those are not equally bad. The 5% false-positives just send a non-SSN to the redactor's review queue — one human glances at it and approves. The 5% false-negatives let real SSNs through to the third-party API — that's a compliance incident, a breach notification, a hit on your trust report. For PII tagging the asymmetry is real and large; bias your gates toward recall, especially on the rarest, costliest entity types.
When the eval fails
Common PII-specific failure patterns and the fix for each:
| Symptom | Root cause | Fix |
|---|---|---|
| EMAIL recall 0.94 on well-formed emails, 0.20 on obfuscated ones ("alice at example dot com") | Gold set is dominated by well-formed Faker emails; the model never learned obfuscation | Generate 30+ obfuscated-email rows via span_extraction_cluster_targeted. Phrasings: "at … dot …", "[at]", "(at)", spaces around the @, full-words ("at gmail dot com"). |
| PERSON precision 0.78 — model flags May, Pat, Will, Iris in non-name contexts | Hard negatives too few — model treats every Title-Case word as a PERSON | Run span_extraction_hard_negatives with 80+ rows targeting common-name-as-word cases. Each generated row puts a name-shaped token in a non-name context. |
| PHONE recall 0.92, DOB recall 0.55 | Date format ambiguity — the model is confident on phones but DOBs are everywhere and look like other dates | Add more DOB-disambiguating context to gold. "born on 03/14/1989" → DOB; "invoice dated 03/14/2024" → not DOB. The carrier sentence shape is the signal. |
| Model conflates ORG and PERSON ("Acme Corp" tagged as PERSON) | Title-case proper-noun bias; ORG was not in your label vocabulary so the model overloads PERSON | Either add ORG to your vocabulary explicitly OR add hard-negative gold rows that pair organisation names with the empty-entity label. |
| F1 strong on test set, recall drops on the realistic eval set | Train/test data is Faker-templated and looks too uniform; production text has more noise | Capture realistic-shape unlabeled rows from your sanitised corpus, hand-label 30, evaluate against that as a separate held-out set BEFORE shipping |
| Every entity type fails the recall floor by 5-10 points | Gold set too small overall; per-entity row counts under 30 | Back-fill with more Faker-templated gold across all types before running another training round. The forecast row in the goal ledger flagged this — the eval is just confirming. |
When the platform's post-eval decision engine surfaces a failure cluster, expand the "Why this fired?" disclosure on each signal. You'll see the actual span examples ("PERSON-precision: 18 false positives, sample tokens: May, Pat, Will, Iris, Hope") rather than just the recommendation verb. Use that cluster to seed the next round of span_extraction_cluster_targeted.
Ship as a pre-API redactor — explicit review required
Once the eval pack passes (with the per-entity recall floors satisfied), ship the tagger inside an explicit-review redaction pipeline. Three steps:
- Export the LoRA adapter. Open Models → Export. The platform writes the adapter weights, the tokenizer config, and a deploy manifest into
data/projects/<id>/exports/. The adapter is ~5-15 MB. - Deploy via vLLM (or Ollama). The recipe's
target_profileisvllm_server:
Ollama variant:cd data/projects/<id>/exports/run-2026-06-04 ./deploy-vllm.sh # Loads the tagger on localhost:8000. # POST /tag with { "text": "..." } returns { "entities": [ { "label": "...", "span_start": ..., "span_end": ..., "value": "..." } ] } # Latency: 20-80ms on a single GPU, 100-300ms on CPU../deploy-ollama.sh. - Wire it as a pre-API redactor with audit logging. The tagger's output drives a downstream redactor that:
- Produces a redacted copy of the text — spans replaced with type-named placeholders (
[EMAIL_1],[PERSON_1],[SSN_1]) or hashed / Faker-pseudonymised. - Keeps the original alongside the redacted copy, with an audit-log entry that records: timestamp, source surface, entity-type counts, the redacted token map, and the operator/policy that authorised the redaction.
- Never auto-destroys the original without a human or a policy gate having reviewed the audit entry. The retention window for originals is your compliance team's call (typically 30-90 days); after that, the originals are purged but the audit entries persist.
- Produces a redacted copy of the text — spans replaced with type-named placeholders (
Auto-redaction without an audit log is the anti-pattern
The temptation, especially when this ships, is to wire the tagger directly into the outbound pipeline and have it redact-and-destroy in one step. Don't. Every false-positive your tagger emits permanently corrupts a piece of text that may have been important — a customer's name in a support ticket gets replaced with [PERSON_1] and the original is gone. Worse, every false-negative is silent; you never know the tagger missed a span until somebody downstream finds it. The audit log is what makes the system auditable: when the compliance team asks "what did you redact last quarter?", you can answer with examples. When they ask "are you sure you didn't miss anything?", you can answer with the false-negative rate from your held-out realistic eval set. This pattern is the platform's safety rule for this surface — explicit review before destruction.
Smoke-test in the playground. Open Playground in the platform. Paste 10 realistic-shape (but synthetic) text samples; check that each PII span gets the right label and the right offsets, and that the empty-entity rows return empty. The per-turn provenance footer (which adapter served the reply, the latency, the per-entity confidence breakdown) is your sanity check before you wire the tagger into a production redaction pipeline.
What's next
You have a deployed PII tagger sitting in front of your outbound API surface, with per-entity recall floors that match the cost of each entity type, and an explicit-review redaction pipeline that never destroys an original without an audit trail. Three obvious next moves:
- Extend the entity vocabulary
- Add passport numbers, IBAN, bank routing numbers, custom enterprise IDs (employee IDs, account numbers, internal customer IDs). Faker generates many of these directly; for the enterprise-specific ones, write a small synthetic generator that mirrors your real ID format and use it as a Faker substitute.
- Tune per-compliance-regime
- GDPR, CCPA, and HIPAA have different definitions of PII / PHI. GDPR considers IP addresses personal data; CCPA scopes "personal information" more broadly than GDPR; HIPAA adds the 18 HIPAA Safe Harbor identifiers. Build one trained adapter per regime if you serve customers under different jurisdictions, OR build one adapter with the union of all entity types and apply per-regime filtering at the redactor layer. The latter is cheaper to maintain.
- Retrain quarterly as new PII shapes emerge
- New chat-style obfuscations, new phone formats (international expansion), new internal identifier shapes. Capture failure-cluster examples from production (audit log review surfaces these for free), promote them into the gold set, retrain when gold grows by ~100 rows. The post-eval decision engine will flag when the new data has shifted the optimal architecture; usually it hasn't — span-extraction is the right shape for the long run.
That's the tutorials series mainline — recipe-grounded, end-to-end, deployment-ready. For more tutorials covering other recipes (code review, summarisation, generic SFT), head back to the tutorials hub.
Key terms
- span-extraction recipe
- BrewSLM recipe that trains a small model to find and label spans inside text. Output is a JSON array of
(label, span_start, span_end, value)objects. Same recipe powers invoice extraction (closed-set, one of each type) and PII tagging (open-set, zero-or-many of each type). - Open-set vs closed-set tagging
- Open-set: the number of spans per row is unbounded (PII). Closed-set: a fixed shape of one-of-each-type (invoice extraction). Same scorer; different failure-mode profile.
- Per-entity recall floor
- An eval gate that requires recall on each entity type to meet a minimum threshold individually. For PII: SSN and credit-card sit at 0.98 because missing one is a compliance incident; PERSON and DOB at 0.85 because the FP cost on those classes is higher.
- Hard negative (PII flavour)
- A string that looks like a PII span but isn't. "May" in "May 2024", "555" in "555 5th Avenue", "Pat" in "Pat the dog". Hard negatives teach the model linguistic context, not just pattern matching.
- Explicit-review redaction
- The deployment pattern this tutorial recommends: tagger emits spans, downstream pipeline produces both a redacted copy and an audit log, no original is destroyed without a human or policy gate having reviewed the audit entry. Auto-redaction without an audit log is the anti-pattern.
- Faker
- Python library that generates realistic-looking but synthetic PII. The canonical source-of-truth for "I need PII-shaped strings I can legally train on."
Check yourself
Answers are saved to this browser.