How do you run a LoRA fine-tuned model for inference?

Load the base model and attach the adapter with PeftModel.from_pretrained

Why use do_sample=False (greedy) for evaluation?

Deterministic, reproducible results

Why compute accuracy on the BASE model too?

To get the baseline so you can report the lift fine-tuning bought

What should you do with the gold-set misses?

Use them to decide which training data to add or fix next

Track 2 · Hands-on · Lesson 7

Evaluate by hand: run the gold set, compute the metric

After this lesson you can load your fine-tuned adapter, generate predictions on the gold set, compute a metric, and compare the fine-tuned model against the base baseline to prove the lift.

Level: intermediate Read time: ~10 min Prerequisites: Run it: read the logs, the loss, the checkpoints

Training produced an adapter. Now we answer the only question that matters: is it better than the base model on our task? We'll load the adapter, run it over the gold set, compute accuracy, and — crucially — compare to the untuned baseline.

Load the base + adapter

LoRA keeps the base frozen, so inference is "base model + adapter." PeftModel.from_pretrained attaches your trained adapter to a fresh base:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
device = "cuda" if torch.cuda.is_available() else "cpu"

tok = AutoTokenizer.from_pretrained("sft-out/adapter")
base = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16).to(device)
tuned = PeftModel.from_pretrained(base, "sft-out/adapter").eval()

A prediction helper

We use greedy decoding (do_sample=False) so results are deterministic and reproducible — exactly what Track 1 recommended for evaluation. The completion is short (a label), so a handful of new tokens is enough.

@torch.no_grad()
def predict(model, prompt):
    msgs = [{"role": "user", "content": prompt}]
    text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
    ids = tok(text, return_tensors="pt").to(model.device)
    out = model.generate(**ids, max_new_tokens=8, do_sample=False,
                         pad_token_id=tok.pad_token_id)
    return tok.decode(out[0][ids["input_ids"].shape[1]:], skip_special_tokens=True).strip().lower()

Score the gold set

gold = [
    {"prompt": "Classify the sentiment as positive or negative: This made my whole week.", "label": "positive"},
    {"prompt": "Classify the sentiment as positive or negative: Never buying from them again.", "label": "negative"},
    # ... your held-out, hand-verified examples
]

def accuracy(model):
    hits = sum(predict(model, ex["prompt"]).startswith(ex["label"]) for ex in gold)
    return hits / len(gold)

print("base  accuracy:", accuracy(base))    # the baseline
print("tuned accuracy:", accuracy(tuned))   # after fine-tuning

Always report the lift

A single accuracy number means little. The difference between the tuned model and the base — the lift — is what your fine-tune actually bought. If tuned ≈ base, fine-tuning didn't help; revisit the data (Track 1) before anything else.

Read the failures, not just the score

Honest evaluation (Track 1) means looking at what the model gets wrong, not just the headline number. Print the misses:

for ex in gold:
    pred = predict(tuned, ex["prompt"])
    if not pred.startswith(ex["label"]):
        print(f"WRONG  pred={pred!r}  gold={ex['label']!r}  :: {ex['prompt']}")

Those misses are your to-do list: they tell you which kinds of examples to add or fix in the training data for the next iteration. That data-centric loop is the heart of getting good. Once you're satisfied with the lift, the last step is packaging the model for use — merging the adapter and shipping an artifact.

Key terms

PeftModel.from_pretrained: Attaches a trained LoRA adapter onto a fresh base model for inference.
greedy decoding: do_sample=False — deterministic generation, used for reproducible evaluation.
accuracy: Fraction of gold examples predicted correctly (a classification metric).
baseline comparison: Scoring the untuned base on the same gold set to measure the fine-tune's lift.
failure analysis: Reading the wrong predictions to decide what data to fix next.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.