Track 4 · Advanced · Lesson 12

Tool-use / function-calling fine-tuning

After this lesson you can train a small LM to emit valid tool calls for a defined tool set, handle the "no tool needed" path, and evaluate the result with the two-number report: valid-tool-call rate plus argument-match accuracy.

Level: advanced Read time: ~10 min Prerequisites: Structured outputs with pydantic

Tool-use — also called function calling — is one of the SLM use cases where small fine-tuned models punch hardest. The task is narrow ("given a request and a tool catalogue, pick the right tool and fill in its arguments"), the output is structured JSON, and a fine-tuned 1B model often beats a much bigger general model. This lesson is how to train one.

Why tool-use suits an SLM

Three properties of tool routing fit small models well:

The data shape

Three roles, one assistant output:

{
  "messages": [
    {
      "role": "system",
      "content": (
        "You route customer-support requests to tools. Available tools:\n"
        "- track_order(order_id: str): get the status of an order.\n"
        "- request_refund(order_id: str, reason: str): file a refund.\n"
        "- escalate(reason: str): hand off to a human agent.\n"
        "If no tool fits, respond with {\"tool\":\"no_tool\", \"reason\":\"...\"}."
      )
    },
    {"role": "user",      "content": "Where's my order ORD-1234?"},
    {"role": "assistant", "content": '{"tool":"track_order","arguments":{"order_id":"ORD-1234"}}'}
  ]
}

Define the schemas with Pydantic

Same pattern as Lesson 2.13. One model per tool, plus the dispatcher wrapper.

from pydantic import BaseModel, Field, ValidationError
from typing import Literal, Union
import json

class TrackOrder(BaseModel):
    tool: Literal["track_order"]
    arguments: dict = Field(...)
    class Args(BaseModel):
        order_id: str

class RequestRefund(BaseModel):
    tool: Literal["request_refund"]
    arguments: dict
    class Args(BaseModel):
        order_id: str
        reason: str

class Escalate(BaseModel):
    tool: Literal["escalate"]
    arguments: dict
    class Args(BaseModel):
        reason: str

class NoTool(BaseModel):
    tool: Literal["no_tool"]
    reason: str

ToolCall = Union[TrackOrder, RequestRefund, Escalate, NoTool]

# Argument validation by tool — applied after Pydantic parses the outer object.
_ARG_MODELS = {
    "track_order":    TrackOrder.Args,
    "request_refund": RequestRefund.Args,
    "escalate":       Escalate.Args,
}

def parse_tool_call(text: str) -> ToolCall | None:
    try:
        data = json.loads(text)
        tool = data.get("tool")
        if tool == "no_tool":
            return NoTool(**data)
        cls = {"track_order": TrackOrder, "request_refund": RequestRefund,
               "escalate":    Escalate}.get(tool)
        if cls is None:
            return None
        obj = cls(**data)
        _ARG_MODELS[tool](**obj.arguments)        # validate arguments shape
        return obj
    except (json.JSONDecodeError, ValidationError, TypeError):
        return None

The no-tool path is training data, not an exception

The single most common tool-router failure is calling a tool every time. If your training data only contains positive examples ("user asked X, model called Y"), the model learns that every input produces a tool call. Reality has plenty of inputs where the right answer is "I need more information" or "this isn't a request I can fulfil." Include those, labelled with no_tool.

# Negative examples — vary the no-tool reason
{
  "messages": [
    {"role": "system",    "content": <tool catalogue>},
    {"role": "user",      "content": "Hi!"},                          # ambient chat
    {"role": "assistant", "content": '{"tool":"no_tool","reason":"greeting, no action needed"}'}
  ]
},
{
  "messages": [
    {"role": "system",    "content": <tool catalogue>},
    {"role": "user",      "content": "What's your refund policy?"},   # info request, not action
    {"role": "assistant", "content": '{"tool":"no_tool","reason":"answer directly; no tool fits"}'}
  ]
},
{
  "messages": [
    {"role": "system",    "content": <tool catalogue>},
    {"role": "user",      "content": "I want a refund."},             # underspecified
    {"role": "assistant", "content": '{"tool":"no_tool","reason":"need order_id and reason before request_refund"}'}
  ]
}

A useful ratio is roughly 10–20% no-tool examples in the training set. Less and the bias toward "always call something" wins; more and the model starts hedging on real tool calls.

Train: same SFTTrainer pattern

The training loop is Lesson 2.10 with this data shape. Greedy decoding at inference (Lesson 1.19) — variety isn't a virtue in a router.

from datasets import Dataset
from trl import SFTConfig, SFTTrainer
from peft import LoraConfig

ds = Dataset.from_list(raw_routing_data)          # ~80% positives, ~20% no_tool

lora = LoraConfig(
    r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj", "v_proj"],
    task_type="CAUSAL_LM",
)

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="tool-router",
        per_device_train_batch_size=8,
        gradient_accumulation_steps=2,
        learning_rate=2e-4,
        num_train_epochs=3,
        lr_scheduler_type="cosine",
        warmup_ratio=0.03,
        bf16=True,
        logging_steps=5,
        save_strategy="epoch",
        max_seq_length=1024,
        completion_only_loss=True,
        report_to=[],
    ),
    train_dataset=ds,
    processing_class=tok,
    peft_config=lora,
)
trainer.train()
trainer.save_model("tool-router/adapter")

Evaluate: the two-number report (tool-flavoured)

Same discipline as Lesson 2.13's structured-output eval, retargeted. The two numbers that matter:

  1. Valid-tool-call rate — fraction of model outputs that parse as JSON, name a tool from the catalogue (including no_tool), and have arguments matching that tool's schema.
  2. Argument-match accuracy on valid calls — fraction of valid calls where the arguments match the gold for the same request.
def eval_router(model, tok, gold: list[dict]) -> dict:
    """gold rows: {"prompt": "...", "target": {"tool": "...", "arguments": {...}}}"""
    valid_calls = 0
    arg_matches = 0
    tool_matches = 0
    no_tool_calls_wrongly_made = 0       # false-positive routing
    for ex in gold:
        text = predict(model, tok, ex["prompt"])
        parsed = parse_tool_call(text)
        if parsed is None:
            continue                      # invalid output
        valid_calls += 1
        target = ex["target"]
        if parsed.tool == target["tool"]:
            tool_matches += 1
            if parsed.tool != "no_tool" and parsed.arguments == target["arguments"]:
                arg_matches += 1
        elif target["tool"] == "no_tool":
            no_tool_calls_wrongly_made += 1

    n = len(gold)
    return {
        "valid_tool_call_rate":     valid_calls / n,
        "tool_name_accuracy":       tool_matches / n,
        "argument_match_accuracy":  arg_matches / max(1, valid_calls),
        "false_tool_call_rate":     no_tool_calls_wrongly_made / n,
    }

Report all four. The first two tell you whether the router routes; the third tells you whether the arguments are right; the fourth tells you whether the no-tool path is working.

Honest beat — the right kind of failure

A router that fails by emitting invalid JSON is a bug your client (the agent loop) can detect and retry. A router that fails by emitting valid JSON for the wrong tool with plausible arguments is a silent corruption that calls real systems with real consequences. Prefer parse failures to wrong calls. If you have to trade, train and decode toward the more conservative behaviour: tight temperature, low repetition_penalty, and explicit no_tool examples for ambiguous prompts.

Inference — parse, validate, call

def route(model, tok, user_request: str, system_prompt: str) -> ToolCall | None:
    msgs = [
        {"role": "system", "content": system_prompt},
        {"role": "user",   "content": user_request},
    ]
    text = generate(model, tok, msgs, max_new_tokens=128, do_sample=False)
    return parse_tool_call(text)

call = route(model, tok, "Where's my order ORD-1234?", SYSTEM_PROMPT)
if call is None:
    log_invalid_output()                              # bug / training gap
    return ask_user_to_rephrase()
if isinstance(call, NoTool):
    return answer_directly(call.reason)
# At this point call is validated; dispatch to the real tool.
result = TOOLS[call.tool](**call.arguments)

Key idea

Tool-use fine-tuning is a structured-output SFT with three deliberate features: tool schemas in the system message, explicit no_tool negatives (10–20% of the training set), and a two-number eval (valid-tool-call rate + argument-match accuracy on the valid calls). A small router fine-tune beats a much bigger general model on latency and cost, every call, forever — provided you teach the no-tool path.

That ends the v2 expansion of Track 4. You can now build a production-grade SLM workflow end-to-end: from picking a base, through SFT and the by-hand pipeline, through the platform, into distillation, preference tuning, quantization, multi-task, serving, observability with a feedback loop, and tool-use routing. Pick a project and ship it.

Key terms

Tool call
A structured JSON output naming a tool (function) from a fixed catalogue and supplying validated arguments for it.
Function calling
Synonym for tool-use; the term used by OpenAI and other vendors for their JSON-output mode.
Tool schema
A description (name + argument schema) of a callable function the model can target; Pydantic models are a clean way to declare them.
No-tool path
An explicit assistant output ({"tool":"no_tool","reason":"..."}) that says "no available tool fits"; trained on negative examples in the SFT data.
Valid-tool-call rate
Fraction of outputs that parse as JSON, name a real tool, and have arguments matching that tool's schema.
Argument-match accuracy
On the parses that succeed, the fraction with arguments matching the gold; the second of the two-number report.
False-tool-call rate
Fraction of inputs where the gold says no_tool but the model called a real tool; the most damaging routing failure mode.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.