Engineering · Dataset pipeline
Why the schema introspector beats hand-written converters
We shipped a 175-line BIO-to-spans converter for the Kaggle PII competition. Then we threw it away and built a column sniffer. The sniffer is now the only path. Here's why.
The original converter was correct, careful, and a dead end
The first PII demo we shipped depended on a script called
kaggle_pii_to_brewslm.py. It read the Kaggle competition's
train.json (BIO-tagged tokens + labels per essay), reconstructed character
offsets by aligning against full_text when present and falling back to a
tokens-plus-trailing-whitespace heuristic when not, merged B-X /
I-X runs into single spans, and mapped the Kaggle tag vocabulary
(NAME_STUDENT, EMAIL, USERNAME,
ID_NUM, etc.) onto BrewSLM's canonical entity types. 175 lines, pure
stdlib, well-tested.
Then we tried to ship the demo with the HuggingFace ai4privacy/pii-masking-200k
dataset, which has the same shape but different column names + a slightly different
tag vocabulary. The first version of the docs read: "you'll need a converter to span
offsets before BrewSLM picks it up." That sentence is the moment we knew this
approach didn't scale.
The category, not the instance
The Kaggle converter is one instance of a category: "BIO-tagged tokens + labels with some entity-type vocabulary." The category is the right unit of abstraction. The instance is what the user has in front of them.
What if the import pipeline could:
- Sniff the columns of any dataset (HF, Kaggle, JSONL, CSV) the same way.
- Detect the category — BIO-tagged tokens here, classification labels there, preference triples elsewhere.
- Propose a mapping that's correct for that category.
- Let the user confirm.
That's the introspector. The Kaggle converter became one configured instance of the
general bio_to_spans mapper, with a project-level entity-type-map
override for the demo's domain. The mapper itself is generic; the converter shrank to
a config dict.
The sniffer is short
The column-classifier looks at ~20 sample rows. For each column, it tries strict
shape predicates first (is the value a list of BIO tags? a list of
{role, content} dicts? a list of {type, start, end} entity
spans?), then falls back to scalar heuristics — boolean, numeric, categorical (low
cardinality on short single-token strings), text-like (long, multi-word). Each rule
is independently testable; the whole module is under 400 lines.
The shape detector then combines column types into ranked hypotheses. A tokens-list
+ a BIO-tag list of the same length is the load-bearing NER fingerprint. A text-like
column + a categorical column with a small label set is classification. A
{prompt, chosen, rejected} triple of text columns is DPO. Etc. Each
hypothesis carries a confidence and a rationale.
Never silently auto-picks
The biggest mistake the introspector could make would be to silently apply a wrong mapping. So it doesn't apply anything: it proposes. Two gates:
-
Confidence threshold (0.80). Below this, the CLI refuses
--autounless you also pass--force. The UI's import wizard shows a red banner that requires an explicit "I've reviewed the proposal" click before unlocking Preview. - Mapper id whitelist. When the optional LLM-assist mode is on (off by default), the teacher model can suggest a mapper, but any mapper id it invents is rejected at the registry boundary. We don't want a hallucinated mapper to escape into the pipeline.
Worth keeping the old converter? Briefly
The 175-line converter still lives in the repo for one reason: it's the canonical example of how a complex domain-specific transform looks before generalization, and it serves as a fixture for the BIO-to-spans mapper's tests. It's no longer in the critical path. The demo doc now leads with the generic flow:
$ python -m app.cli.dataset_import introspect \
--locator kaggle:competition:pii-detection-removal-from-educational-data
top proposal: bio_to_spans (confidence 0.95) — safe to --auto
$ python -m app.cli.dataset_import run \
--locator kaggle:competition:pii-detection-removal-from-educational-data \
--project 1 --auto --limit 5000
What this changes
For the user: a new dataset is a few seconds of introspection + a single
--auto flag, not a converter and a doc page. For us: maintenance load
drops from N converters (one per dataset shape) to M mappers (one per task shape),
and M is bounded by the number of training paradigms, which is small. The wins
compound as the source catalog grows — adding the HuggingFace and Kaggle connectors
in Phase D / E was each a single file that wired into the same introspector +
mapper stack.