Track 2 · Hands-on · Lesson 8

Merge the adapter, run inference, ship an artifact

After this lesson you can merge a LoRA adapter into the base, save and reload a standalone fine-tuned model, run inference on it, and know what export formats exist for deployment.

Level: intermediate Read time: ~8 min Prerequisites: Evaluate by hand: run the gold set, compute the metric

You have a fine-tuned adapter that beats the baseline. To use it in production you have two choices, both covered in Track 1: keep the base + adapter separate (handy for serving many tasks off one base), or merge the adapter into the weights to get a single standalone model. This lesson does the merge and ships an artifact.

Merge the adapter into the base

merge_and_unload folds the LoRA adapter's learned change directly into the base weights and returns an ordinary model with no adapter and no runtime overhead.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
tok = AutoTokenizer.from_pretrained("sft-out/adapter")
base = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
tuned = PeftModel.from_pretrained(base, "sft-out/adapter")

merged = tuned.merge_and_unload()      # a plain AutoModelForCausalLM, no PEFT wrapper

Save it as a standalone model

save_pretrained writes the weights in safetensors — the safe, fast default format — plus the config. Save the tokenizer alongside so the folder is self-contained.

merged.save_pretrained("sft-out/merged")     # model.safetensors + config.json
tok.save_pretrained("sft-out/merged")        # tokenizer files

import os
print(os.listdir("sft-out/merged"))

Reload and run — no peft needed

The merged folder is a normal model. Loading it requires only transformers; there's no adapter to attach.

m = AutoModelForCausalLM.from_pretrained("sft-out/merged", dtype=torch.bfloat16)
t = AutoTokenizer.from_pretrained("sft-out/merged")

msgs = [{"role": "user", "content": "Classify the sentiment as positive or negative: I would buy it again."}]
text = t.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids = t(text, return_tensors="pt")
print(t.decode(m.generate(**ids, max_new_tokens=8, do_sample=False)[0][ids["input_ids"].shape[1]:],
               skip_special_tokens=True).strip())

Merge vs keep separate

Merge when you'll deploy this one task — simplest to serve, zero adapter overhead. Keep the adapter separate when you serve many tasks off one shared base — swap a few-megabyte adapter instead of loading a whole model per task.

Export formats for deployment

The safetensors folder runs anywhere transformers (or vLLM) runs — great for a GPU server. For edge / CPU deployment you'll often convert to GGUF (the llama.cpp format) and quantize, so the model runs efficiently on a laptop or phone. Conversion is a separate step with llama.cpp's tooling; the key point is that your merged model is the starting artifact for any of these paths. (BrewSLM automates export to GGUF / ONNX / vLLM — Track 3.)

You've now done the entire by-hand pipeline: environment, model, data, tokenization, LoRA training, evaluation, and shipping. The final lesson stitches it into one script you can run start-to-finish.

Key terms

merge_and_unload
Folds the LoRA adapter into the base weights, returning a standalone model.
safetensors
The safe, fast default weight format save_pretrained writes.
save_pretrained / from_pretrained
Write / load a self-contained model + config (+ tokenizer).
standalone model
A merged model that needs no adapter at inference.
GGUF
The llama.cpp format for efficient CPU/edge inference (a separate conversion step).

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.