Clean and prepare: the manifest is the source of truth
After this lesson you can run the optional clean stage, understand what Prepare produces, and explain why the manifest — not raw files — is the single source of truth for training and evaluation.
In Track 2 you tokenized examples, built the loss mask, and split train/val by hand. BrewSLM folds all of that into two stages whose output is one authoritative artifact: the manifest.
05 Clean (optional)
If your rows are raw documents rather than tidy pairs, the clean stage processes them: it emits cleaned text, chunk JSONL, and per-document metadata — PII findings, a quality score, and a text hash. It runs as a background task with task_id polling so a big cleaning pass never blocks past the proxy timeout, and it has explicit failure modes: cleaning_pii_block and cleaning_outlier_threshold_exceeded. For already-clean classification pairs like our sentiment data, you can skip it.
06 Prepare — build the manifest
Prepare takes your synthetic + cleaned rows and the project's dataset-adapter preset and produces:
prepared/
manifest.json # the source of truth: counts, schema, task_profile,
# output_schema.scoring_mode, paths, hashes
train.jsonl # the training split
eval.jsonl # the held-out evaluation split
That train.jsonl / eval.jsonl split is the platform doing your train_test_split from Track 2 — and the held-out eval.jsonl is what the eval pack will score against later.
From Track 2: where did tokenization and the loss mask go?
They didn't disappear — they moved into the task handler at train time. The manifest records the task_profile and scoring_mode; the handler for that task knows how to apply the chat template, build input_ids/labels, and mask the prompt — exactly the mechanics you wrote in lesson 2.4. You declared the what; the handler does the how.
The manifest is the contract
The single most important rule downstream: nothing reads from disk paths directly — everything reads the manifest. Train, Evaluate, and Export all consume manifest.json. That's why a BrewSLM run is reproducible in a way your script wasn't: the manifest pins exactly which rows, which schema, which scoring mode, and which hashes went into the run. Change the data and you get a new manifest, not a silently different result from the same code.
Key idea
Prepare turns reviewed rows into a manifest plus train/eval splits. The manifest carries the task profile and scoring mode and is the source of truth for every later stage — so the tokenization and masking you did by hand become a declaration the handler executes, reproducibly.
Key terms
- Clean stage
- Optional processing of raw documents into cleaned text + chunks + PII/quality metadata.
- manifest
- prepared/manifest.json — the authoritative record of rows, schema, task_profile, scoring_mode, paths, and hashes.
- train/eval split
- train.jsonl and eval.jsonl produced by Prepare; the platform's train_test_split.
- task_profile
- The declared task type recorded in the manifest, used to pick the handler.
- scoring_mode
- How outputs are scored, recorded in the manifest's output_schema.
Check yourself
Answers are saved to this browser.