Ingest & map your data with per-row accountability
After this lesson you can ingest a dataset from a locator, accept or override the introspector's proposed mapping, read the per-row rejection breakdown in a dry-run, and commit the import.
In Track 2 you wrote your dataset as a Python list of dicts. That's perfect for learning and hopeless at scale: no provenance, no rejection accounting, no reuse. BrewSLM's stages 01–04 turn "get data in" into an auditable, per-row-accountable import.
01 Ingest — source locators
You don't write a loader; you name a source. A locator says where the rows live:
hf:imdb:train # a Hugging Face dataset + split
jsonl:/path/to/data.jsonl
kaggle:competition:slug
Source connectors implement load() + describe(), and — importantly — unparseable rows surface as sentinel rows, not silent drops. You always know what didn't parse.
02 Introspect — guess the shape, never silently
The introspector samples ~20 rows and proposes how to map them to a task: a ranked list of ShapeHypothesis objects and a top ProposedMapping with a confidence and a rationale. This is the platform doing what you did in your head when you decided "prompt goes here, label goes there." Its contract: it never auto-picks silently — below 0.80 confidence it requires an explicit --force.
03 Map (dry-run) — preview before you commit
Before writing anything, a dry-run shows a sample of accepted rows and a full rejection breakdown grouped by reason. Every input row becomes either a TransformedRow or a RejectedRow with a stable reason code — the per-row accountability your list of dicts never had.
$ python -m app.cli.dataset_import run \
--locator hf:imdb:train --project 1 --auto --limit 5000 --dry-run
accepted: 4,812
rejected: 188
empty_text ............ 121
label_out_of_vocab .... 54
duplicate ............. 13
A pattern you'll see everywhere
Rejected rows are never all-or-nothing. They're grouped by reason and individually selectable, so you can bulk-drop the genuine junk and rescue the rows that were rejected for a fixable reason. This is the platform version of the data hygiene you learned in Track 1.
04 Map (commit) — write it down, emit the event
After you confirm, the mapped rows are appended to the project's dataset, and the stage emits its RunEvent — the first durable trace in your audit spine:
RunEvent: ingestion / dataset_import_run (severity: info)
payload: { source, mapper, accepted: 4812, rejected: 188,
written_path, config_id }
on error → dataset_import_failed
One nuance worth knowing: bulk-dropping in the UI affects only the sample rendered to you; the true accepted/rejected counts are preserved in the audit row, so the numbers never lie. With data ingested and mapped, the next lesson covers the platform's answer to "in practice, hundreds more rows" — generating synthetic data and reviewing it.
Key idea
Importing data isn't one step — it's ingest → introspect → dry-run → commit, and every row is accounted for. You trade a hand-written list for provenance, a confidence-scored mapping, and a rejection breakdown you can act on.
Key terms
- source locator
- A string naming where rows live (hf:id:split, jsonl:/path, kaggle:...).
- Introspect
- Sampling rows to propose a task mapping; never auto-picks below 0.80 confidence.
- ProposedMapping
- The top mapping proposal with a confidence and rationale.
- Map (dry-run)
- A preview of accepted rows plus a full rejection breakdown, before writing.
- RejectedRow
- An input row that failed mapping, tagged with a stable reason code.
- dataset_import RunEvent
- The audit event emitted on commit, carrying accepted/rejected counts and the written path.
Check yourself
Answers are saved to this browser.