Tool-use / function-calling fine-tuning
After this lesson you can train a small LM to emit valid tool calls for a defined tool set, handle the "no tool needed" path, and evaluate the result with the two-number report: valid-tool-call rate plus argument-match accuracy.
Tool-use — also called function calling — is one of the SLM use cases where small fine-tuned models punch hardest. The task is narrow ("given a request and a tool catalogue, pick the right tool and fill in its arguments"), the output is structured JSON, and a fine-tuned 1B model often beats a much bigger general model. This lesson is how to train one.
Why tool-use suits an SLM
Three properties of tool routing fit small models well:
- Narrow output space. The model picks a tool from a fixed catalogue and emits arguments matching that tool's schema. A general 70B model knows the world; this task only needs the tool catalogue.
- Strict format. Output must parse as JSON and validate against a schema (Lesson 2.13's territory). Fine-tuning specifically for the format closes the gap quickly.
- Latency-critical. Tool calls run inside agent loops; every call's latency multiplies. A small router fine-tune at 200 ms beats a frontier model at 2 s, every step, forever.
The data shape
Three roles, one assistant output:
- System: the tool catalogue — names, descriptions, JSON schemas of arguments.
- User: the natural-language request.
- Assistant: a JSON object — either a tool call (
{"tool": "...", "arguments": {...}}) or the explicit no-tool path ({"tool": "no_tool", "reason": "..."}).
{
"messages": [
{
"role": "system",
"content": (
"You route customer-support requests to tools. Available tools:\n"
"- track_order(order_id: str): get the status of an order.\n"
"- request_refund(order_id: str, reason: str): file a refund.\n"
"- escalate(reason: str): hand off to a human agent.\n"
"If no tool fits, respond with {\"tool\":\"no_tool\", \"reason\":\"...\"}."
)
},
{"role": "user", "content": "Where's my order ORD-1234?"},
{"role": "assistant", "content": '{"tool":"track_order","arguments":{"order_id":"ORD-1234"}}'}
]
}
Define the schemas with Pydantic
Same pattern as Lesson 2.13. One model per tool, plus the dispatcher wrapper.
from pydantic import BaseModel, Field, ValidationError
from typing import Literal, Union
import json
class TrackOrder(BaseModel):
tool: Literal["track_order"]
arguments: dict = Field(...)
class Args(BaseModel):
order_id: str
class RequestRefund(BaseModel):
tool: Literal["request_refund"]
arguments: dict
class Args(BaseModel):
order_id: str
reason: str
class Escalate(BaseModel):
tool: Literal["escalate"]
arguments: dict
class Args(BaseModel):
reason: str
class NoTool(BaseModel):
tool: Literal["no_tool"]
reason: str
ToolCall = Union[TrackOrder, RequestRefund, Escalate, NoTool]
# Argument validation by tool — applied after Pydantic parses the outer object.
_ARG_MODELS = {
"track_order": TrackOrder.Args,
"request_refund": RequestRefund.Args,
"escalate": Escalate.Args,
}
def parse_tool_call(text: str) -> ToolCall | None:
try:
data = json.loads(text)
tool = data.get("tool")
if tool == "no_tool":
return NoTool(**data)
cls = {"track_order": TrackOrder, "request_refund": RequestRefund,
"escalate": Escalate}.get(tool)
if cls is None:
return None
obj = cls(**data)
_ARG_MODELS[tool](**obj.arguments) # validate arguments shape
return obj
except (json.JSONDecodeError, ValidationError, TypeError):
return None
The no-tool path is training data, not an exception
The single most common tool-router failure is calling a tool every time. If your training data only contains positive examples ("user asked X, model called Y"), the model learns that every input produces a tool call. Reality has plenty of inputs where the right answer is "I need more information" or "this isn't a request I can fulfil." Include those, labelled with no_tool.
# Negative examples — vary the no-tool reason
{
"messages": [
{"role": "system", "content": <tool catalogue>},
{"role": "user", "content": "Hi!"}, # ambient chat
{"role": "assistant", "content": '{"tool":"no_tool","reason":"greeting, no action needed"}'}
]
},
{
"messages": [
{"role": "system", "content": <tool catalogue>},
{"role": "user", "content": "What's your refund policy?"}, # info request, not action
{"role": "assistant", "content": '{"tool":"no_tool","reason":"answer directly; no tool fits"}'}
]
},
{
"messages": [
{"role": "system", "content": <tool catalogue>},
{"role": "user", "content": "I want a refund."}, # underspecified
{"role": "assistant", "content": '{"tool":"no_tool","reason":"need order_id and reason before request_refund"}'}
]
}
A useful ratio is roughly 10–20% no-tool examples in the training set. Less and the bias toward "always call something" wins; more and the model starts hedging on real tool calls.
Train: same SFTTrainer pattern
The training loop is Lesson 2.10 with this data shape. Greedy decoding at inference (Lesson 1.19) — variety isn't a virtue in a router.
from datasets import Dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig
ds = Dataset.from_list(raw_routing_data) # ~80% positives, ~20% no_tool
lora = LoraConfig(
r=16, lora_alpha=32, lora_dropout=0.05,
target_modules=["q_proj", "v_proj"],
task_type="CAUSAL_LM",
)
trainer = SFTTrainer(
model=model,
args=SFTConfig(
output_dir="tool-router",
per_device_train_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=2e-4,
num_train_epochs=3,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
logging_steps=5,
save_strategy="epoch",
max_seq_length=1024,
completion_only_loss=True,
report_to=[],
),
train_dataset=ds,
processing_class=tok,
peft_config=lora,
)
trainer.train()
trainer.save_model("tool-router/adapter")
Evaluate: the two-number report (tool-flavoured)
Same discipline as Lesson 2.13's structured-output eval, retargeted. The two numbers that matter:
- Valid-tool-call rate — fraction of model outputs that parse as JSON, name a tool from the catalogue (including
no_tool), and have arguments matching that tool's schema. - Argument-match accuracy on valid calls — fraction of valid calls where the arguments match the gold for the same request.
def eval_router(model, tok, gold: list[dict]) -> dict:
"""gold rows: {"prompt": "...", "target": {"tool": "...", "arguments": {...}}}"""
valid_calls = 0
arg_matches = 0
tool_matches = 0
no_tool_calls_wrongly_made = 0 # false-positive routing
for ex in gold:
text = predict(model, tok, ex["prompt"])
parsed = parse_tool_call(text)
if parsed is None:
continue # invalid output
valid_calls += 1
target = ex["target"]
if parsed.tool == target["tool"]:
tool_matches += 1
if parsed.tool != "no_tool" and parsed.arguments == target["arguments"]:
arg_matches += 1
elif target["tool"] == "no_tool":
no_tool_calls_wrongly_made += 1
n = len(gold)
return {
"valid_tool_call_rate": valid_calls / n,
"tool_name_accuracy": tool_matches / n,
"argument_match_accuracy": arg_matches / max(1, valid_calls),
"false_tool_call_rate": no_tool_calls_wrongly_made / n,
}
Report all four. The first two tell you whether the router routes; the third tells you whether the arguments are right; the fourth tells you whether the no-tool path is working.
Honest beat — the right kind of failure
A router that fails by emitting invalid JSON is a bug your client (the agent loop) can detect and retry. A router that fails by emitting valid JSON for the wrong tool with plausible arguments is a silent corruption that calls real systems with real consequences. Prefer parse failures to wrong calls. If you have to trade, train and decode toward the more conservative behaviour: tight temperature, low repetition_penalty, and explicit no_tool examples for ambiguous prompts.
Inference — parse, validate, call
def route(model, tok, user_request: str, system_prompt: str) -> ToolCall | None:
msgs = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_request},
]
text = generate(model, tok, msgs, max_new_tokens=128, do_sample=False)
return parse_tool_call(text)
call = route(model, tok, "Where's my order ORD-1234?", SYSTEM_PROMPT)
if call is None:
log_invalid_output() # bug / training gap
return ask_user_to_rephrase()
if isinstance(call, NoTool):
return answer_directly(call.reason)
# At this point call is validated; dispatch to the real tool.
result = TOOLS[call.tool](**call.arguments)
Key idea
Tool-use fine-tuning is a structured-output SFT with three deliberate features: tool schemas in the system message, explicit no_tool negatives (10–20% of the training set), and a two-number eval (valid-tool-call rate + argument-match accuracy on the valid calls). A small router fine-tune beats a much bigger general model on latency and cost, every call, forever — provided you teach the no-tool path.
That ends the v2 expansion of Track 4. You can now build a production-grade SLM workflow end-to-end: from picking a base, through SFT and the by-hand pipeline, through the platform, into distillation, preference tuning, quantization, multi-task, serving, observability with a feedback loop, and tool-use routing. Pick a project and ship it.
Key terms
- Tool call
- A structured JSON output naming a tool (function) from a fixed catalogue and supplying validated arguments for it.
- Function calling
- Synonym for tool-use; the term used by OpenAI and other vendors for their JSON-output mode.
- Tool schema
- A description (name + argument schema) of a callable function the model can target; Pydantic models are a clean way to declare them.
- No-tool path
- An explicit assistant output (
{"tool":"no_tool","reason":"..."}) that says "no available tool fits"; trained on negative examples in the SFT data. - Valid-tool-call rate
- Fraction of outputs that parse as JSON, name a real tool, and have arguments matching that tool's schema.
- Argument-match accuracy
- On the parses that succeed, the fraction with arguments matching the gold; the second of the two-number report.
- False-tool-call rate
- Fraction of inputs where the gold says no_tool but the model called a real tool; the most damaging routing failure mode.
Check yourself
Answers are saved to this browser.