Merge the adapter, run inference, ship an artifact
After this lesson you can merge a LoRA adapter into the base, save and reload a standalone fine-tuned model, run inference on it, and know what export formats exist for deployment.
You have a fine-tuned adapter that beats the baseline. To use it in production you have two choices, both covered in Track 1: keep the base + adapter separate (handy for serving many tasks off one base), or merge the adapter into the weights to get a single standalone model. This lesson does the merge and ships an artifact.
Merge the adapter into the base
merge_and_unload folds the LoRA adapter's learned change directly into the base weights and returns an ordinary model with no adapter and no runtime overhead.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
model_id = "HuggingFaceTB/SmolLM2-135M-Instruct"
tok = AutoTokenizer.from_pretrained("sft-out/adapter")
base = AutoModelForCausalLM.from_pretrained(model_id, dtype=torch.bfloat16)
tuned = PeftModel.from_pretrained(base, "sft-out/adapter")
merged = tuned.merge_and_unload() # a plain AutoModelForCausalLM, no PEFT wrapper
Save it as a standalone model
save_pretrained writes the weights in safetensors — the safe, fast default format — plus the config. Save the tokenizer alongside so the folder is self-contained.
merged.save_pretrained("sft-out/merged") # model.safetensors + config.json
tok.save_pretrained("sft-out/merged") # tokenizer files
import os
print(os.listdir("sft-out/merged"))
Reload and run — no peft needed
The merged folder is a normal model. Loading it requires only transformers; there's no adapter to attach.
m = AutoModelForCausalLM.from_pretrained("sft-out/merged", dtype=torch.bfloat16)
t = AutoTokenizer.from_pretrained("sft-out/merged")
msgs = [{"role": "user", "content": "Classify the sentiment as positive or negative: I would buy it again."}]
text = t.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
ids = t(text, return_tensors="pt")
print(t.decode(m.generate(**ids, max_new_tokens=8, do_sample=False)[0][ids["input_ids"].shape[1]:],
skip_special_tokens=True).strip())
Merge vs keep separate
Merge when you'll deploy this one task — simplest to serve, zero adapter overhead. Keep the adapter separate when you serve many tasks off one shared base — swap a few-megabyte adapter instead of loading a whole model per task.
Export formats for deployment
The safetensors folder runs anywhere transformers (or vLLM) runs — great for a GPU server. For edge / CPU deployment you'll often convert to GGUF (the llama.cpp format) and quantize, so the model runs efficiently on a laptop or phone. Conversion is a separate step with llama.cpp's tooling; the key point is that your merged model is the starting artifact for any of these paths. (BrewSLM automates export to GGUF / ONNX / vLLM — Track 3.)
You've now done the entire by-hand pipeline: environment, model, data, tokenization, LoRA training, evaluation, and shipping. The final lesson stitches it into one script you can run start-to-finish.
Key terms
- merge_and_unload
- Folds the LoRA adapter into the base weights, returning a standalone model.
- safetensors
- The safe, fast default weight format save_pretrained writes.
- save_pretrained / from_pretrained
- Write / load a self-contained model + config (+ tokenizer).
- standalone model
- A merged model that needs no adapter at inference.
- GGUF
- The llama.cpp format for efficient CPU/edge inference (a separate conversion step).
Check yourself
Answers are saved to this browser.