What is the success criterion for this capstone?

The tuned model clearly beats the base on the held-out gold set, in the right format

The tuned model shows no lift over the base. Where do you look first?

The data — too few/ wrong-label/ unrepresentative examples

The model rambles instead of emitting just the label. Likely cause?

Chat template / loss-mask issue

Why move on to a platform (Track 3) after doing this by hand?

Doing the loop many times — with imports, gates, eval, deployment — is toil worth automating

Track 2 · Hands-on · Lesson 9

Capstone A: fine-tune SmolLM2 end-to-end, by hand

After this capstone you can run a complete fine-tune from scratch in one script, judge success against the base on a gold set, and know exactly which earlier lesson to revisit when something falls short.

Level: intermediate Read time: ~10 min Prerequisites: Merge the adapter, run inference, ship an artifact

This capstone assembles Lessons 2.1–2.8 into one script you can run top to bottom. Nothing here is new — it's the whole loop in one place, so you can see how few moving parts a real fine-tune has once you understand each one.

The whole pipeline, one script

import torch
from datasets import Dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,
                          TrainingArguments, Trainer, DataCollatorForSeq2Seq)
from peft import LoraConfig, get_peft_model, PeftModel

MODEL_ID, MAX_LEN = "HuggingFaceTB/SmolLM2-135M-Instruct", 256
device = "cuda" if torch.cuda.is_available() else "cpu"

# 1) data --------------------------------------------------------------
raw = [  # in practice: hundreds of clean, balanced, deduplicated rows
    {"prompt": "Classify the sentiment as positive or negative: I loved this movie.", "completion": "positive"},
    {"prompt": "Classify the sentiment as positive or negative: A complete waste of time.", "completion": "negative"},
    # ...
]
ds = Dataset.from_list(raw).train_test_split(test_size=0.2, seed=42)

# 2) model + tokenizer -------------------------------------------------
tok = AutoTokenizer.from_pretrained(MODEL_ID)
if tok.pad_token is None: tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16).to(device)
model.config.pad_token_id = tok.pad_token_id

# 3) tokenize with a loss mask ----------------------------------------
def tokenize(row):
    full = tok.apply_chat_template(
        [{"role":"user","content":row["prompt"]},{"role":"assistant","content":row["completion"]}], tokenize=False)
    prompt = tok.apply_chat_template(
        [{"role":"user","content":row["prompt"]}], tokenize=False, add_generation_prompt=True)
    ids = tok(full, truncation=True, max_length=MAX_LEN, add_special_tokens=False)["input_ids"]
    p = len(tok(prompt, add_special_tokens=False)["input_ids"])
    labels = [-100]*min(p, len(ids)) + ids[min(p, len(ids)):]
    return {"input_ids": ids, "attention_mask":[1]*len(ids), "labels": labels}
train_tok = ds["train"].map(tokenize, remove_columns=ds["train"].column_names)
val_tok   = ds["test"].map(tokenize, remove_columns=ds["test"].column_names)

# 4) LoRA + train ------------------------------------------------------
model = get_peft_model(model, LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","v_proj"], task_type="CAUSAL_LM"))
model.print_trainable_parameters()
Trainer(
    model=model,
    args=TrainingArguments(output_dir="sft-out", per_device_train_batch_size=8,
        gradient_accumulation_steps=2, learning_rate=2e-4, num_train_epochs=3,
        lr_scheduler_type="cosine", warmup_ratio=0.03, bf16=True,
        logging_steps=5, eval_strategy="epoch", save_strategy="epoch", report_to=[]),
    train_dataset=train_tok, eval_dataset=val_tok,
    data_collator=DataCollatorForSeq2Seq(tok, model=model), processing_class=tok,
).train()
model.save_pretrained("sft-out/adapter")

# 5) evaluate vs base on a gold set -----------------------------------
@torch.no_grad()
def predict(m, prompt):
    text = tok.apply_chat_template([{"role":"user","content":prompt}], tokenize=False, add_generation_prompt=True)
    ids = tok(text, return_tensors="pt").to(m.device)
    return tok.decode(m.generate(**ids, max_new_tokens=8, do_sample=False, pad_token_id=tok.pad_token_id)[0][ids["input_ids"].shape[1]:], skip_special_tokens=True).strip().lower()

gold = [{"prompt":"Classify the sentiment as positive or negative: This made my week.","label":"positive"}]  # + more
base = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16).to(device)
tuned = PeftModel.from_pretrained(base, "sft-out/adapter").eval()
acc = lambda m: sum(predict(m, g["prompt"]).startswith(g["label"]) for g in gold) / len(gold)
print("base:", acc(base), " tuned:", acc(tuned))

# 6) ship --------------------------------------------------------------
tuned.merge_and_unload().save_pretrained("sft-out/merged"); tok.save_pretrained("sft-out/merged")

Success criteria

You've succeeded when, on your held-out gold set, the tuned model clearly beats the base and produces the format you want (here: just the label, no rambling). If you set a gate in advance (Track 1) — say "≥ 90% gold accuracy" — the run passes only when it clears it.

If it falls short — where to look

No lift over base → data problem: too few examples, wrong labels, or not representative (Tracks 1.7–1.8). Rambling / wrong format → check the chat template and loss mask (Tracks 1.3–1.4, 2.4). Eval loss rose → overfitting: fewer epochs or more data (Track 1.17). NaN / no learning → learning rate (Track 1.11). Every failure maps to a lesson.

You can fine-tune a model

That's the whole thing. You loaded a base model, taught it a task from your own data, proved the lift on a gold set, and shipped an artifact — entirely by hand, with no platform and nothing hidden. Everything BrewSLM does, you can now do yourself; you understand each step well enough to debug it.

So why use a platform at all? Because doing this once by hand is illuminating; doing it fifty times — with imports, gates, eval packs, deployment, and monitoring — is exactly the toil worth automating. Track 3 runs this same pipeline through BrewSLM and shows what the platform buys you, now that you know precisely what it's doing underneath.

Key terms

end-to-end pipeline: The full run: data → model → tokenize → train → evaluate → ship, in one script.
success criteria: A clear bar — beat the base on the gold set and produce the right format.
quality gate: A pass threshold set in advance the run must clear (Track 1).
iteration: Mapping each failure mode back to the lesson that fixes it, then re-running.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.