What does a source locator like hf:imdb:train specify?

Where the rows live and which split

The introspector's confidence is below 0.80. What happens?

It requires an explicit --force to proceed

In a Map dry-run, every input row becomes…

Either a TransformedRow or a RejectedRow with a reason code

Why are unparseable rows surfaced as sentinel rows?

So nothing is silently dropped — you always know what didn't parse

Track 3 · With BrewSLM · Lesson 2

Ingest & map your data with per-row accountability

After this lesson you can ingest a dataset from a locator, accept or override the introspector's proposed mapping, read the per-row rejection breakdown in a dry-run, and commit the import.

Level: intermediate Read time: ~10 min Prerequisites: From script to platform: the BrewSLM lifecycle

In Track 2 you wrote your dataset as a Python list of dicts. That's perfect for learning and hopeless at scale: no provenance, no rejection accounting, no reuse. BrewSLM's stages 01–04 turn "get data in" into an auditable, per-row-accountable import.

01 Ingest — source locators

You don't write a loader; you name a source. A locator says where the rows live:

hf:imdb:train            # a Hugging Face dataset + split
jsonl:/path/to/data.jsonl
kaggle:competition:slug

Source connectors implement load() + describe(), and — importantly — unparseable rows surface as sentinel rows, not silent drops. You always know what didn't parse.

02 Introspect — guess the shape, never silently

The introspector samples ~20 rows and proposes how to map them to a task: a ranked list of ShapeHypothesis objects and a top ProposedMapping with a confidence and a rationale. This is the platform doing what you did in your head when you decided "prompt goes here, label goes there." Its contract: it never auto-picks silently — below 0.80 confidence it requires an explicit --force.

03 Map (dry-run) — preview before you commit

Before writing anything, a dry-run shows a sample of accepted rows and a full rejection breakdown grouped by reason. Every input row becomes either a TransformedRow or a RejectedRow with a stable reason code — the per-row accountability your list of dicts never had.

$ python -m app.cli.dataset_import run \
    --locator hf:imdb:train --project 1 --auto --limit 5000 --dry-run

accepted: 4,812
rejected:   188
  empty_text ............ 121
  label_out_of_vocab ....  54
  duplicate .............  13

A pattern you'll see everywhere

Rejected rows are never all-or-nothing. They're grouped by reason and individually selectable, so you can bulk-drop the genuine junk and rescue the rows that were rejected for a fixable reason. This is the platform version of the data hygiene you learned in Track 1.

04 Map (commit) — write it down, emit the event

After you confirm, the mapped rows are appended to the project's dataset, and the stage emits its RunEvent — the first durable trace in your audit spine:

RunEvent: ingestion / dataset_import_run   (severity: info)
  payload: { source, mapper, accepted: 4812, rejected: 188,
             written_path, config_id }
  on error → dataset_import_failed

One nuance worth knowing: bulk-dropping in the UI affects only the sample rendered to you; the true accepted/rejected counts are preserved in the audit row, so the numbers never lie. With data ingested and mapped, the next lesson covers the platform's answer to "in practice, hundreds more rows" — generating synthetic data and reviewing it.

Key idea

Importing data isn't one step — it's ingest → introspect → dry-run → commit, and every row is accounted for. You trade a hand-written list for provenance, a confidence-scored mapping, and a rejection breakdown you can act on.

Key terms

source locator: A string naming where rows live (hf:id:split, jsonl:/path, kaggle:...).
Introspect: Sampling rows to propose a task mapping; never auto-picks below 0.80 confidence.
ProposedMapping: The top mapping proposal with a confidence and rationale.
Map (dry-run): A preview of accepted rows plus a full rejection breakdown, before writing.
RejectedRow: An input row that failed mapping, tagged with a stable reason code.
dataset_import RunEvent: The audit event emitted on commit, carrying accepted/rejected counts and the written path.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.