Creation paths for ML teams

Three ways to fine-tune a base model on custom data.

BrewSLM exposes the same small language model training workflow through a CLI, an HTTP/Python API, and a guided Wizard UI. Pick the surface that matches how your team works, not a different feature tier.

At-a-glance

Fit by team shape

CLI

You already script your ML lifecycle. You want one binary per stage, exit codes, and stdout you can grep.

First milestone: a scripted import + train + export job under CI.

HTTP / Python API

The SLM lifecycle is one piece of a larger backend. You want the same pipeline as a service call.

First milestone: a service-triggered import + run flow with task-id polling.

Wizard UI

You're a newbie, a non-shell user, or just want to see what's happening. Inline column rundown, ranked hypotheses, bulk-drop, audit log.

First milestone: a complete import in three clicks, no flags.

CLI

Shell-first, scriptable, no hidden state

The subcommand surface

$ python -m app.cli.dataset_import sources

csv hf jsonl kaggle

 

$ python -m app.cli.dataset_import mappers

bio_to_spans → task_profile=structured_extraction

chat_messages_passthrough → task_profile=chat_sft

kv_to_structured → task_profile=structured_extraction

label_to_classification → task_profile=classification

preference_pair → task_profile=dpo

qa_pair_passthrough → task_profile=qa

rag_passthrough → task_profile=rag_qa

text_only → task_profile=language_modeling

 

$ python -m app.cli.dataset_import introspect \

    --locator hf:imdb:train

 

$ python -m app.cli.dataset_import run \

    --locator hf:imdb:train --project 1 --auto --limit 5000

Flag highlights

  • --auto picks the mapper from the introspector's top proposal (gated at 0.80 confidence).
  • --force overrides the confidence gate. Pairs with --auto.
  • --map K=V + --map-json '{…}' override the auto-suggested field map.
  • --drop REASON bulk-drops a rejection category (counts stay in the audit).
  • --limit N stops after N source rows.
  • --llm-assist (opt-in) lets the teacher model propose a mapping when confidence is low.

Every command is also exposed via HTTP. The CLI is a thin wrapper over the same service functions the API hits.

HTTP / Python API

Same pipeline, callable from any service

Import: introspect → preview → run

POST /api/dataset-import/introspect

  { "locator": "hf:imdb:train", "sample_size": 20 }

 

POST /api/projects/1/dataset-import/preview

  { "locator": "hf:imdb:train",

    "mapper_id": "label_to_classification",

    "field_map": { "text_field": "text", "label_field": "label" },

    "sample_cap": 5 }

 

POST /api/projects/1/dataset-import/run

  { "locator": "hf:imdb:train",

    "mapper_id": "label_to_classification",

    "field_map": { "text_field": "text", "label_field": "label" },

    "drop_reasons": ["missing_text"] }

Save once, re-run forever

# 1. Save the mapping after the first import lands.

POST /api/projects/1/dataset-import/configs

  { "name": "weekly-pii-refresh",

    "locator": "kaggle:competition:pii-detection-…",

    "mapper_id": "bio_to_spans",

    "field_map": {…} }

 

# 2. Re-run anytime against the (refreshed) source.

POST /api/projects/1/dataset-import/configs/12/run

 

# 3. Read the audit stream.

GET /api/projects/1/run-events?stage=ingestion

Wizard UI

Three clicks. Same pipeline. No flags.

Step 1 — Source

  • Source dropdown (jsonl / csv / hf / kaggle).
  • Per-source helper text + auth banner for gated datasets.
  • Click Introspect.

Step 2 — Map

  • Column-signatures table (detected type + confidence per column).
  • Ranked-hypothesis dropdown pre-selected to the top proposal.
  • JSON field-map editor for overrides. Low-confidence proposals show a red gate banner until you tick "proceed anyway."

Step 3 — Preview & Confirm

  • Summary cards (accepted, rejected, mapper, target profile).
  • Rejected rows grouped by reason; tick to bulk-drop.
  • "Save this mapping" persists the config for one-click re-runs.
  • Final Import commits + emits the audit RunEvent.

Learn The Concept

The lessons behind the three surfaces

The CLI, API, and Wizard all wrap the same fine-tuning mechanics. These lessons help you decide what your team is doing when it picks a dataset format, adapts a base model, or moves from a notebook workflow into a platform.

Pick a door

All three lead to the same project

You can switch surfaces mid-project. A wizard import is indistinguishable from a CLI import in the audit log; a saved mapping created from the UI is callable from the API; everything composes.

# CLI

$ python -m app.cli.dataset_import run --locator hf:imdb:train --project 1 --auto

 

# API

$ curl -X POST localhost:8000/api/projects/1/dataset-import/run \

    -H 'Content-Type: application/json' \

    -d '{"locator":"hf:imdb:train","mapper_id":"label_to_classification","field_map":{}}'

 

# UI

# Pipeline → Data → "Import dataset (auto-mapping)" → Introspect → Map → Confirm