Capstone A: fine-tune SmolLM2 end-to-end, by hand
After this capstone you can run a complete fine-tune from scratch in one script, judge success against the base on a gold set, and know exactly which earlier lesson to revisit when something falls short.
This capstone assembles Lessons 2.1–2.8 into one script you can run top to bottom. Nothing here is new — it's the whole loop in one place, so you can see how few moving parts a real fine-tune has once you understand each one.
The whole pipeline, one script
import torch
from datasets import Dataset
from transformers import (AutoModelForCausalLM, AutoTokenizer,
TrainingArguments, Trainer, DataCollatorForSeq2Seq)
from peft import LoraConfig, get_peft_model, PeftModel
MODEL_ID, MAX_LEN = "HuggingFaceTB/SmolLM2-135M-Instruct", 256
device = "cuda" if torch.cuda.is_available() else "cpu"
# 1) data --------------------------------------------------------------
raw = [ # in practice: hundreds of clean, balanced, deduplicated rows
{"prompt": "Classify the sentiment as positive or negative: I loved this movie.", "completion": "positive"},
{"prompt": "Classify the sentiment as positive or negative: A complete waste of time.", "completion": "negative"},
# ...
]
ds = Dataset.from_list(raw).train_test_split(test_size=0.2, seed=42)
# 2) model + tokenizer -------------------------------------------------
tok = AutoTokenizer.from_pretrained(MODEL_ID)
if tok.pad_token is None: tok.pad_token = tok.eos_token
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16).to(device)
model.config.pad_token_id = tok.pad_token_id
# 3) tokenize with a loss mask ----------------------------------------
def tokenize(row):
full = tok.apply_chat_template(
[{"role":"user","content":row["prompt"]},{"role":"assistant","content":row["completion"]}], tokenize=False)
prompt = tok.apply_chat_template(
[{"role":"user","content":row["prompt"]}], tokenize=False, add_generation_prompt=True)
ids = tok(full, truncation=True, max_length=MAX_LEN, add_special_tokens=False)["input_ids"]
p = len(tok(prompt, add_special_tokens=False)["input_ids"])
labels = [-100]*min(p, len(ids)) + ids[min(p, len(ids)):]
return {"input_ids": ids, "attention_mask":[1]*len(ids), "labels": labels}
train_tok = ds["train"].map(tokenize, remove_columns=ds["train"].column_names)
val_tok = ds["test"].map(tokenize, remove_columns=ds["test"].column_names)
# 4) LoRA + train ------------------------------------------------------
model = get_peft_model(model, LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj","v_proj"], task_type="CAUSAL_LM"))
model.print_trainable_parameters()
Trainer(
model=model,
args=TrainingArguments(output_dir="sft-out", per_device_train_batch_size=8,
gradient_accumulation_steps=2, learning_rate=2e-4, num_train_epochs=3,
lr_scheduler_type="cosine", warmup_ratio=0.03, bf16=True,
logging_steps=5, eval_strategy="epoch", save_strategy="epoch", report_to=[]),
train_dataset=train_tok, eval_dataset=val_tok,
data_collator=DataCollatorForSeq2Seq(tok, model=model), processing_class=tok,
).train()
model.save_pretrained("sft-out/adapter")
# 5) evaluate vs base on a gold set -----------------------------------
@torch.no_grad()
def predict(m, prompt):
text = tok.apply_chat_template([{"role":"user","content":prompt}], tokenize=False, add_generation_prompt=True)
ids = tok(text, return_tensors="pt").to(m.device)
return tok.decode(m.generate(**ids, max_new_tokens=8, do_sample=False, pad_token_id=tok.pad_token_id)[0][ids["input_ids"].shape[1]:], skip_special_tokens=True).strip().lower()
gold = [{"prompt":"Classify the sentiment as positive or negative: This made my week.","label":"positive"}] # + more
base = AutoModelForCausalLM.from_pretrained(MODEL_ID, dtype=torch.bfloat16).to(device)
tuned = PeftModel.from_pretrained(base, "sft-out/adapter").eval()
acc = lambda m: sum(predict(m, g["prompt"]).startswith(g["label"]) for g in gold) / len(gold)
print("base:", acc(base), " tuned:", acc(tuned))
# 6) ship --------------------------------------------------------------
tuned.merge_and_unload().save_pretrained("sft-out/merged"); tok.save_pretrained("sft-out/merged")
Success criteria
You've succeeded when, on your held-out gold set, the tuned model clearly beats the base and produces the format you want (here: just the label, no rambling). If you set a gate in advance (Track 1) — say "≥ 90% gold accuracy" — the run passes only when it clears it.
If it falls short — where to look
No lift over base → data problem: too few examples, wrong labels, or not representative (Tracks 1.7–1.8). Rambling / wrong format → check the chat template and loss mask (Tracks 1.3–1.4, 2.4). Eval loss rose → overfitting: fewer epochs or more data (Track 1.17). NaN / no learning → learning rate (Track 1.11). Every failure maps to a lesson.
You can fine-tune a model
That's the whole thing. You loaded a base model, taught it a task from your own data, proved the lift on a gold set, and shipped an artifact — entirely by hand, with no platform and nothing hidden. Everything BrewSLM does, you can now do yourself; you understand each step well enough to debug it.
So why use a platform at all? Because doing this once by hand is illuminating; doing it fifty times — with imports, gates, eval packs, deployment, and monitoring — is exactly the toil worth automating. Track 3 runs this same pipeline through BrewSLM and shows what the platform buys you, now that you know precisely what it's doing underneath.
Key terms
- end-to-end pipeline
- The full run: data → model → tokenize → train → evaluate → ship, in one script.
- success criteria
- A clear bar — beat the base on the gold set and produce the right format.
- quality gate
- A pass threshold set in advance the run must clear (Track 1).
- iteration
- Mapping each failure mode back to the lesson that fixes it, then re-running.
Check yourself
Answers are saved to this browser.