Structured outputs with pydantic
After this lesson you can validate a fine-tuned model's JSON outputs against a Pydantic schema, compute the valid-JSON rate, measure per-field accuracy on the parses that succeed, and report both — the honest two-number summary for any structured-output model.
A huge fraction of useful SLM work emits JSON: extracting fields from an invoice, classifying intent with a confidence and a routed tool, parsing structured details from free text. For those models the question "is the output right?" splits into two: did it parse? and do the fields match? A model with 95% parse rate and 60% field accuracy on what does parse is, in production, broken — and reporting only one of those two numbers hides that.
Why "valid JSON" alone isn't enough
Three things can go wrong with a structured output:
- The string isn't valid JSON (missing comma, trailing text, hallucinated commentary).
- The JSON is valid but the schema is wrong (missing required fields, wrong types).
- The schema is right but the values are wrong (parses fine, but the
invoice_numberis hallucinated).
You need a validator that catches the first two and a comparison that scores the third. Pydantic gives you the first; you build the second.
Define the schema with Pydantic
A Pydantic model describes the shape: field names, types, required vs optional, nested structures. Validation is one call.
from pydantic import BaseModel, ValidationError
from typing import List, Optional
class LineItem(BaseModel):
description: str
amount: float
class Invoice(BaseModel):
invoice_number: str
date: str
total: float
line_items: List[LineItem] = []
customer: Optional[str] = None
Required fields have no default. Optional fields have a default (often None). Types are checked at parse time. Nested types compose.
Parse with validation
Wrap parsing in a single function that returns the validated object or None if anything went wrong. The two failure modes (bad JSON, bad schema) are distinct exceptions; record which it was.
import json
from dataclasses import dataclass
@dataclass
class ParseResult:
obj: Optional[Invoice]
failure: Optional[str] # "json", "schema", or None on success
def parse(text: str) -> ParseResult:
try:
data = json.loads(text)
except json.JSONDecodeError:
return ParseResult(None, "json")
try:
return ParseResult(Invoice(**data), None)
except ValidationError:
return ParseResult(None, "schema")
Tracking which failure happened tells you where to focus. Lots of "json" failures = a decoding / format problem (Lesson 1.19). Lots of "schema" failures = a training problem (the model has learned a slightly different shape).
Compute the valid-JSON rate + failure breakdown
# Run the model on the gold set (predict() from Lesson 2.7)
results = [parse(predict(tuned, ex["prompt"])) for ex in gold]
valid = [r for r in results if r.obj is not None]
n_json_fail = sum(1 for r in results if r.failure == "json")
n_schema_fail = sum(1 for r in results if r.failure == "schema")
print(f"valid-JSON rate : {len(valid)/len(results):.0%}")
print(f"json failures : {n_json_fail} ({n_json_fail/len(results):.0%})")
print(f"schema failures : {n_schema_fail} ({n_schema_fail/len(results):.0%})")
Per-field accuracy on the parses that succeed
Of the rows that did parse, how often does each field match the truth? This is where you discover that "valid JSON" can hide a model that just makes up plausible numbers.
from collections import Counter
# gold entries have a "target" dict that matches the Invoice schema
hits = Counter()
totals = Counter()
for ex, r in zip(gold, results):
if r.obj is None:
continue
truth = Invoice(**ex["target"])
for field in Invoice.model_fields:
totals[field] += 1
if getattr(r.obj, field) == getattr(truth, field):
hits[field] += 1
print(f"\nper-field accuracy (over {len(valid)} valid parses):")
for field in Invoice.model_fields:
if totals[field]:
print(f" {field:18s} {hits[field]/totals[field]:.0%}")
Sample output:
valid-JSON rate : 95%
json failures : 3 (3%)
schema failures : 2 (2%)
per-field accuracy (over 95 valid parses):
invoice_number 91%
date 88%
total 61%
line_items 100%
customer 96%
Honest beat — the two-number report
The honest summary of this run is "95% valid-JSON, with 61% accuracy on total among valid parses." Not "95% accurate." Anyone who reads the headline "95% valid-JSON" and assumes the values are right will deploy a model that hallucinates totals 39% of the time. Always report the valid-JSON rate and per-field accuracy, side by side. If you only have room for one number, it's macro per-field accuracy across the full gold set (counting failed parses as zero) — that single number can't lie in either direction.
Repairing failures vs retraining
When the model's outputs have JSON failures, before retraining check the cheap things:
- Decoding (Lesson 1.19): lower temperature, set a stop token, cap
max_new_tokens. A model emitting prose after the JSON is often a stop-token / max-tokens issue. - Trailing commas / partial JSON: a small "repair on parse fail" step (regex strip prose, try parse again) can rescue 50% of
jsonfailures cheaply. - Schema drift: if many parses succeed but a specific field is consistently wrong, it's a training-data problem. Audit the training examples for that field.
Key idea
For structured-output models, report two numbers: the valid-JSON rate (did the parse succeed?) and per-field accuracy on the parses that did. One alone is misleading. Use Pydantic for type-checked, schema-validated parsing; track json vs schema failure modes separately so you know whether to fix decoding or data.
You now have an honest evaluation for both classification (Lesson 2.12) and structured-output (this lesson) models. Next: multi-turn chat fine-tuning, where the loss mask needs to cover every assistant turn, not just one.
Key terms
- Pydantic
- Python library for type-validated data models;
BaseModel+ type hints become a schema and a parser in one. - Valid-JSON rate
- Fraction of model outputs that both parse as JSON and match the schema; the first half of the structured-output report.
- Schema validation
- Checking that parsed JSON has the required fields with the expected types; distinct from "parseable JSON."
- Per-field accuracy
- Fraction of times each schema field has the correct value, measured on the parses that succeeded.
- JSON vs schema failure
- Two distinct failure modes — the string isn't valid JSON, or it parses but doesn't match the schema; track them separately to know what to fix.
- Repair on parse fail
- Cheap regex-strip-and-retry step that rescues a meaningful fraction of JSON failures without retraining.
Check yourself
Answers are saved to this browser.