Track 2 · Hands-on · Lesson 13

Structured outputs with pydantic

After this lesson you can validate a fine-tuned model's JSON outputs against a Pydantic schema, compute the valid-JSON rate, measure per-field accuracy on the parses that succeed, and report both — the honest two-number summary for any structured-output model.

Level: intermediate Read time: ~9 min Prerequisites: Real metrics with sklearn & HF evaluate

A huge fraction of useful SLM work emits JSON: extracting fields from an invoice, classifying intent with a confidence and a routed tool, parsing structured details from free text. For those models the question "is the output right?" splits into two: did it parse? and do the fields match? A model with 95% parse rate and 60% field accuracy on what does parse is, in production, broken — and reporting only one of those two numbers hides that.

Why "valid JSON" alone isn't enough

Three things can go wrong with a structured output:

You need a validator that catches the first two and a comparison that scores the third. Pydantic gives you the first; you build the second.

Define the schema with Pydantic

A Pydantic model describes the shape: field names, types, required vs optional, nested structures. Validation is one call.

from pydantic import BaseModel, ValidationError
from typing import List, Optional

class LineItem(BaseModel):
    description: str
    amount: float

class Invoice(BaseModel):
    invoice_number: str
    date: str
    total: float
    line_items: List[LineItem] = []
    customer: Optional[str] = None

Required fields have no default. Optional fields have a default (often None). Types are checked at parse time. Nested types compose.

Parse with validation

Wrap parsing in a single function that returns the validated object or None if anything went wrong. The two failure modes (bad JSON, bad schema) are distinct exceptions; record which it was.

import json
from dataclasses import dataclass

@dataclass
class ParseResult:
    obj: Optional[Invoice]
    failure: Optional[str]    # "json", "schema", or None on success

def parse(text: str) -> ParseResult:
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        return ParseResult(None, "json")
    try:
        return ParseResult(Invoice(**data), None)
    except ValidationError:
        return ParseResult(None, "schema")

Tracking which failure happened tells you where to focus. Lots of "json" failures = a decoding / format problem (Lesson 1.19). Lots of "schema" failures = a training problem (the model has learned a slightly different shape).

Compute the valid-JSON rate + failure breakdown

# Run the model on the gold set (predict() from Lesson 2.7)
results = [parse(predict(tuned, ex["prompt"])) for ex in gold]

valid = [r for r in results if r.obj is not None]
n_json_fail   = sum(1 for r in results if r.failure == "json")
n_schema_fail = sum(1 for r in results if r.failure == "schema")

print(f"valid-JSON rate    : {len(valid)/len(results):.0%}")
print(f"json failures      : {n_json_fail}  ({n_json_fail/len(results):.0%})")
print(f"schema failures    : {n_schema_fail}  ({n_schema_fail/len(results):.0%})")

Per-field accuracy on the parses that succeed

Of the rows that did parse, how often does each field match the truth? This is where you discover that "valid JSON" can hide a model that just makes up plausible numbers.

from collections import Counter

# gold entries have a "target" dict that matches the Invoice schema
hits = Counter()
totals = Counter()

for ex, r in zip(gold, results):
    if r.obj is None:
        continue
    truth = Invoice(**ex["target"])
    for field in Invoice.model_fields:
        totals[field] += 1
        if getattr(r.obj, field) == getattr(truth, field):
            hits[field] += 1

print(f"\nper-field accuracy (over {len(valid)} valid parses):")
for field in Invoice.model_fields:
    if totals[field]:
        print(f"  {field:18s} {hits[field]/totals[field]:.0%}")

Sample output:

valid-JSON rate    : 95%
json failures      : 3  (3%)
schema failures    : 2  (2%)

per-field accuracy (over 95 valid parses):
  invoice_number     91%
  date               88%
  total              61%
  line_items        100%
  customer           96%

Honest beat — the two-number report

The honest summary of this run is "95% valid-JSON, with 61% accuracy on total among valid parses." Not "95% accurate." Anyone who reads the headline "95% valid-JSON" and assumes the values are right will deploy a model that hallucinates totals 39% of the time. Always report the valid-JSON rate and per-field accuracy, side by side. If you only have room for one number, it's macro per-field accuracy across the full gold set (counting failed parses as zero) — that single number can't lie in either direction.

Repairing failures vs retraining

When the model's outputs have JSON failures, before retraining check the cheap things:

Key idea

For structured-output models, report two numbers: the valid-JSON rate (did the parse succeed?) and per-field accuracy on the parses that did. One alone is misleading. Use Pydantic for type-checked, schema-validated parsing; track json vs schema failure modes separately so you know whether to fix decoding or data.

You now have an honest evaluation for both classification (Lesson 2.12) and structured-output (this lesson) models. Next: multi-turn chat fine-tuning, where the loss mask needs to cover every assistant turn, not just one.

Key terms

Pydantic
Python library for type-validated data models; BaseModel + type hints become a schema and a parser in one.
Valid-JSON rate
Fraction of model outputs that both parse as JSON and match the schema; the first half of the structured-output report.
Schema validation
Checking that parsed JSON has the required fields with the expected types; distinct from "parseable JSON."
Per-field accuracy
Fraction of times each schema field has the correct value, measured on the parses that succeeded.
JSON vs schema failure
Two distinct failure modes — the string isn't valid JSON, or it parses but doesn't match the schema; track them separately to know what to fix.
Repair on parse fail
Cheap regex-strip-and-retry step that rescues a meaningful fraction of JSON failures without retraining.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.