Evaluate by hand: run the gold set, compute the metric
After this lesson you can load your fine-tuned adapter, generate predictions on the gold set, compute a metric, and compare the fine-tuned model against the base baseline to prove the lift.
Training produced an adapter. Now we answer the only question that matters: is it better than the base model on our task? We'll load the adapter, run it over the gold set, compute accuracy, and — crucially — compare to the untuned baseline.
Load the base + adapter
LoRA keeps the base frozen, so inference is "base model + adapter." PeftModel.from_pretrained attaches your trained adapter to a fresh base:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"
tok = AutoTokenizer.from_pretrained("sft-out/adapter")
base = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to(device)
tuned = PeftModel.from_pretrained(base, "sft-out/adapter").eval()
A prediction helper
We use greedy decoding (do_sample=False) so results are deterministic and reproducible — exactly what Track 1 recommended for evaluation. The completion is short (a label), so a handful of new tokens is enough.
@torch.no_grad()
def predict(model, prompt):
msgs = [{"role": "user", "content": prompt}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**ids, max_new_tokens=8, do_sample=False,
pad_token_id=tok.pad_token_id)
return tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True).strip().lower()
Score the gold set
gold = [
{"prompt": "Classify the sentiment as positive or negative: This made my whole week.", "label": "positive"},
{"prompt": "Classify the sentiment as positive or negative: Never buying from them again.", "label": "negative"},
# ... your held-out, hand-verified examples
]
def accuracy(model):
hits = sum(predict(model, ex["prompt"]).startswith(ex["label"]) for ex in gold)
return hits / len(gold)
print("base accuracy:", accuracy(base)) # the baseline
print("tuned accuracy:", accuracy(tuned)) # after fine-tuning
Always report the lift
A single accuracy number means little. The difference between the tuned model and the base — the lift — is what your fine-tune actually bought. If tuned ≈ base, fine-tuning didn't help; revisit the data (Track 1) before anything else.
Read the failures, not just the score
Honest evaluation (Track 1) means looking at what the model gets wrong, not just the headline number. Print the misses:
for ex in gold:
pred = predict(tuned, ex["prompt"])
if not pred.startswith(ex["label"]):
print(f"WRONG pred={pred!r} gold={ex['label']!r} :: {ex['prompt']}")
Those misses are your to-do list: they tell you which kinds of examples to add or fix in the training data for the next iteration. That data-centric loop is the heart of getting good. Once you're satisfied with the lift, the last step is packaging the model for use — merging the adapter and shipping an artifact.
Key terms
- PeftModel.from_pretrained
- Attaches a trained LoRA adapter onto a fresh base model for inference.
- greedy decoding
- do_sample=False — deterministic generation, used for reproducible evaluation.
- accuracy
- Fraction of gold examples predicted correctly (a classification metric).
- baseline comparison
- Scoring the untuned base on the same gold set to measure the fine-tune's lift.
- failure analysis
- Reading the wrong predictions to decide what data to fix next.
Check yourself
Answers are saved to this browser.