What two numbers report a structured-output model honestly?

Valid-JSON rate (how often parses succeed) and per-field accuracy on the parses that do succeed. Either alone is misleading.

Why use Pydantic over a plain json.loads?

Pydantic validates types and required fields too — 'parseable JSON' isn't the same as 'matches the schema'.

Track 2 · Hands-on · Lesson 13

Structured outputs with pydantic

After this lesson you can validate a fine-tuned model's JSON outputs against a Pydantic schema, compute the valid-JSON rate, measure per-field accuracy on the parses that succeed, and report both — the honest two-number summary for any structured-output model.

Level: intermediate Read time: ~9 min Prerequisites: Real metrics with sklearn & HF evaluate

A huge fraction of useful SLM work emits JSON: extracting fields from an invoice, classifying intent with a confidence and a routed tool, parsing structured details from free text. For those models the question "is the output right?" splits into two: did it parse? and do the fields match? A model with 95% parse rate and 60% field accuracy on what does parse is, in production, broken — and reporting only one of those two numbers hides that.

Why "valid JSON" alone isn't enough

Three things can go wrong with a structured output:

The string isn't valid JSON (missing comma, trailing text, hallucinated commentary).
The JSON is valid but the schema is wrong (missing required fields, wrong types).
The schema is right but the values are wrong (parses fine, but the invoice_number is hallucinated).

You need a validator that catches the first two and a comparison that scores the third. Pydantic gives you the first; you build the second.

Define the schema with Pydantic

A Pydantic model describes the shape: field names, types, required vs optional, nested structures. Validation is one call.

from pydantic import BaseModel, ValidationError
from typing import List, Optional

class LineItem(BaseModel):
    description: str
    amount: float

class Invoice(BaseModel):
    invoice_number: str
    date: str
    total: float
    line_items: List[LineItem] = []
    customer: Optional[str] = None

Required fields have no default. Optional fields have a default (often None). Types are checked at parse time. Nested types compose.

Parse with validation

Wrap parsing in a single function that returns the validated object or None if anything went wrong. The two failure modes (bad JSON, bad schema) are distinct exceptions; record which it was.

import json
from dataclasses import dataclass

@dataclass
class ParseResult:
    obj: Optional[Invoice]
    failure: Optional[str]    # "json", "schema", or None on success

def parse(text: str) -> ParseResult:
    try:
        data = json.loads(text)
    except json.JSONDecodeError:
        return ParseResult(None, "json")
    try:
        return ParseResult(Invoice(**data), None)
    except ValidationError:
        return ParseResult(None, "schema")

Tracking which failure happened tells you where to focus. Lots of "json" failures = a decoding / format problem (Lesson 1.19). Lots of "schema" failures = a training problem (the model has learned a slightly different shape).

Compute the valid-JSON rate + failure breakdown

# Run the model on the gold set (predict() from Lesson 2.7)
results = [parse(predict(tuned, ex["prompt"])) for ex in gold]

valid = [r for r in results if r.obj is not None]
n_json_fail   = sum(1 for r in results if r.failure == "json")
n_schema_fail = sum(1 for r in results if r.failure == "schema")

print(f"valid-JSON rate    : {len(valid)/len(results):.0%}")
print(f"json failures      : {n_json_fail}  ({n_json_fail/len(results):.0%})")
print(f"schema failures    : {n_schema_fail}  ({n_schema_fail/len(results):.0%})")

Per-field accuracy on the parses that succeed

Of the rows that did parse, how often does each field match the truth? This is where you discover that "valid JSON" can hide a model that just makes up plausible numbers.

from collections import Counter

# gold entries have a "target" dict that matches the Invoice schema
hits = Counter()
totals = Counter()

for ex, r in zip(gold, results):
    if r.obj is None:
        continue
    truth = Invoice(**ex["target"])
    for field in Invoice.model_fields:
        totals[field] += 1
        if getattr(r.obj, field) == getattr(truth, field):
            hits[field] += 1

print(f"\nper-field accuracy (over {len(valid)} valid parses):")
for field in Invoice.model_fields:
    if totals[field]:
        print(f"  {field:18s} {hits[field]/totals[field]:.0%}")

Sample output:

valid-JSON rate    : 95%
json failures      : 3  (3%)
schema failures    : 2  (2%)

per-field accuracy (over 95 valid parses):
  invoice_number     91%
  date               88%
  total              61%
  line_items        100%
  customer           96%

Honest beat — the two-number report

The honest summary of this run is "95% valid-JSON, with 61% accuracy on total among valid parses." Not "95% accurate." Anyone who reads the headline "95% valid-JSON" and assumes the values are right will deploy a model that hallucinates totals 39% of the time. Always report the valid-JSON rate and per-field accuracy, side by side. If you only have room for one number, it's macro per-field accuracy across the full gold set (counting failed parses as zero) — that single number can't lie in either direction.

Repairing failures vs retraining

When the model's outputs have JSON failures, before retraining check the cheap things:

Decoding (Lesson 1.19): lower temperature, set a stop token, cap max_new_tokens. A model emitting prose after the JSON is often a stop-token / max-tokens issue.
Trailing commas / partial JSON: a small "repair on parse fail" step (regex strip prose, try parse again) can rescue 50% of json failures cheaply.
Schema drift: if many parses succeed but a specific field is consistently wrong, it's a training-data problem. Audit the training examples for that field.

Key idea

For structured-output models, report two numbers: the valid-JSON rate (did the parse succeed?) and per-field accuracy on the parses that did. One alone is misleading. Use Pydantic for type-checked, schema-validated parsing; track json vs schema failure modes separately so you know whether to fix decoding or data.

You now have an honest evaluation for both classification (Lesson 2.12) and structured-output (this lesson) models. Next: multi-turn chat fine-tuning, where the loss mask needs to cover every assistant turn, not just one.

Key terms

Pydantic: Python library for type-validated data models; BaseModel + type hints become a schema and a parser in one.
Valid-JSON rate: Fraction of model outputs that both parse as JSON and match the schema; the first half of the structured-output report.
Schema validation: Checking that parsed JSON has the required fields with the expected types; distinct from "parseable JSON."
Per-field accuracy: Fraction of times each schema field has the correct value, measured on the parses that succeeded.
JSON vs schema failure: Two distinct failure modes — the string isn't valid JSON, or it parses but doesn't match the schema; track them separately to know what to fix.
Repair on parse fail: Cheap regex-strip-and-retry step that rescues a meaningful fraction of JSON failures without retraining.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.