What is the manifest the source of truth for?

Every downstream stage — train, evaluate, export all read it

Your by-hand tokenization and loss mask moved where in BrewSLM?

Into the task handler at train time, driven by the manifest's task_profile

What does Prepare produce?

manifest.json plus train.jsonl and eval.jsonl

Why is a BrewSLM run more reproducible than your script?

The manifest pins exact rows, schema, scoring mode, and hashes

Track 3 · With BrewSLM · Lesson 4

Clean and prepare: the manifest is the source of truth

After this lesson you can run the optional clean stage, understand what Prepare produces, and explain why the manifest — not raw files — is the single source of truth for training and evaluation.

Level: intermediate Read time: ~10 min Prerequisites: Synthetic data and the review queue

In Track 2 you tokenized examples, built the loss mask, and split train/val by hand. BrewSLM folds all of that into two stages whose output is one authoritative artifact: the manifest.

05 Clean (optional)

If your rows are raw documents rather than tidy pairs, the clean stage processes them: it emits cleaned text, chunk JSONL, and per-document metadata — PII findings, a quality score, and a text hash. It runs as a background task with task_id polling so a big cleaning pass never blocks past the proxy timeout, and it has explicit failure modes: cleaning_pii_block and cleaning_outlier_threshold_exceeded. For already-clean classification pairs like our sentiment data, you can skip it.

06 Prepare — build the manifest

Prepare takes your synthetic + cleaned rows and the project's dataset-adapter preset and produces:

prepared/
  manifest.json     # the source of truth: counts, schema, task_profile,
                    # output_schema.scoring_mode, paths, hashes
  train.jsonl       # the training split
  eval.jsonl        # the held-out evaluation split

That train.jsonl / eval.jsonl split is the platform doing your train_test_split from Track 2 — and the held-out eval.jsonl is what the eval pack will score against later.

From Track 2: where did tokenization and the loss mask go?

They didn't disappear — they moved into the task handler at train time. The manifest records the task_profile and scoring_mode; the handler for that task knows how to apply the chat template, build input_ids/labels, and mask the prompt — exactly the mechanics you wrote in lesson 2.4. You declared the what; the handler does the how.

The manifest is the contract

The single most important rule downstream: nothing reads from disk paths directly — everything reads the manifest. Train, Evaluate, and Export all consume manifest.json. That's why a BrewSLM run is reproducible in a way your script wasn't: the manifest pins exactly which rows, which schema, which scoring mode, and which hashes went into the run. Change the data and you get a new manifest, not a silently different result from the same code.

Key idea

Prepare turns reviewed rows into a manifest plus train/eval splits. The manifest carries the task profile and scoring mode and is the source of truth for every later stage — so the tokenization and masking you did by hand become a declaration the handler executes, reproducibly.

Key terms

Clean stage: Optional processing of raw documents into cleaned text + chunks + PII/quality metadata.
manifest: prepared/manifest.json — the authoritative record of rows, schema, task_profile, scoring_mode, paths, and hashes.
train/eval split: train.jsonl and eval.jsonl produced by Prepare; the platform's train_test_split.
task_profile: The declared task type recorded in the manifest, used to pick the handler.
scoring_mode: How outputs are scored, recorded in the manifest's output_schema.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.