Engineering · Dataset pipeline

Why the schema introspector beats hand-written converters

We shipped a 175-line BIO-to-spans converter for the Kaggle PII competition. Then we threw it away and built a column sniffer. The sniffer is now the only path. Here's why.

The original converter was correct, careful, and a dead end

The first PII demo we shipped depended on a script called kaggle_pii_to_brewslm.py. It read the Kaggle competition's train.json (BIO-tagged tokens + labels per essay), reconstructed character offsets by aligning against full_text when present and falling back to a tokens-plus-trailing-whitespace heuristic when not, merged B-X / I-X runs into single spans, and mapped the Kaggle tag vocabulary (NAME_STUDENT, EMAIL, USERNAME, ID_NUM, etc.) onto BrewSLM's canonical entity types. 175 lines, pure stdlib, well-tested.

Then we tried to ship the demo with the HuggingFace ai4privacy/pii-masking-200k dataset, which has the same shape but different column names + a slightly different tag vocabulary. The first version of the docs read: "you'll need a converter to span offsets before BrewSLM picks it up." That sentence is the moment we knew this approach didn't scale.

The category, not the instance

The Kaggle converter is one instance of a category: "BIO-tagged tokens + labels with some entity-type vocabulary." The category is the right unit of abstraction. The instance is what the user has in front of them.

What if the import pipeline could:

  1. Sniff the columns of any dataset (HF, Kaggle, JSONL, CSV) the same way.
  2. Detect the category — BIO-tagged tokens here, classification labels there, preference triples elsewhere.
  3. Propose a mapping that's correct for that category.
  4. Let the user confirm.

That's the introspector. The Kaggle converter became one configured instance of the general bio_to_spans mapper, with a project-level entity-type-map override for the demo's domain. The mapper itself is generic; the converter shrank to a config dict.

The sniffer is short

The column-classifier looks at ~20 sample rows. For each column, it tries strict shape predicates first (is the value a list of BIO tags? a list of {role, content} dicts? a list of {type, start, end} entity spans?), then falls back to scalar heuristics — boolean, numeric, categorical (low cardinality on short single-token strings), text-like (long, multi-word). Each rule is independently testable; the whole module is under 400 lines.

The shape detector then combines column types into ranked hypotheses. A tokens-list + a BIO-tag list of the same length is the load-bearing NER fingerprint. A text-like column + a categorical column with a small label set is classification. A {prompt, chosen, rejected} triple of text columns is DPO. Etc. Each hypothesis carries a confidence and a rationale.

Never silently auto-picks

The biggest mistake the introspector could make would be to silently apply a wrong mapping. So it doesn't apply anything: it proposes. Two gates:

Worth keeping the old converter? Briefly

The 175-line converter still lives in the repo for one reason: it's the canonical example of how a complex domain-specific transform looks before generalization, and it serves as a fixture for the BIO-to-spans mapper's tests. It's no longer in the critical path. The demo doc now leads with the generic flow:

$ python -m app.cli.dataset_import introspect \
    --locator kaggle:competition:pii-detection-removal-from-educational-data

top proposal: bio_to_spans (confidence 0.95) — safe to --auto

$ python -m app.cli.dataset_import run \
    --locator kaggle:competition:pii-detection-removal-from-educational-data \
    --project 1 --auto --limit 5000

What this changes

For the user: a new dataset is a few seconds of introspection + a single --auto flag, not a converter and a doc page. For us: maintenance load drops from N converters (one per dataset shape) to M mappers (one per task shape), and M is bounded by the number of training paradigms, which is small. The wins compound as the source catalog grows — adding the HuggingFace and Kaggle connectors in Phase D / E was each a single file that wired into the same introspector + mapper stack.