Engineering · Evaluation

Inside the task-aware eval handler dispatcher

A single "pass rate" metric is the eval equivalent of measuring car safety by average vehicle weight. Useful in exactly one bucket; misleading everywhere else. BrewSLM ships nine task handlers so the metric fits the task.

The motivating bug

A user trained a PII span-extraction model. Eval reported F1 ≈ 0.04. The model looked fine in spot-checks. What was wrong?

The eval was running classification-style F1 against the whole prediction string vs the whole reference string. The reference was a JSON entity list with three spans. The prediction was a JSON entity list with the same three spans plus one hallucinated street address. Char-level F1 was decimating the score because every extra character in the hallucinated address counted as a false positive. The model had ~92% recall on the real entities; the eval was reading 4%.

Different task, different metric. The fix is structural: pick the scoring shape based on what the task is, not on a one-size-fits-all string-comparison default.

The dispatcher

Every project carries a task_profile on its prepared manifest: classification, qa, structured_extraction, rag_qa, dpo, seq2seq, chat_sft, language_modeling, vision_language, audio_transcript, safety. Plus aliases (extraction, preference, image_captioning, etc.) that map to the same handlers.

The eval entry point resolves the profile to a handler via a registry-and-dispatcher pattern. Each handler implements two methods:

class TaskHandler(Protocol):
    profile_id: str

    def build_prompts(
        self, rows: list[dict], ctx: EvalContext,
    ) -> list[BuiltPrompt]:
        """Row → prompt + reference + extras."""

    def score(
        self, predictions: list[dict], ctx: EvalContext,
    ) -> dict[str, Any]:
        """Predictions → metric dict."""

That's the entire contract. Everything else is per-handler.

What each handler does that the others can't

StructuredExtractionHandler · span_set mode

The fix for the PII bug. The handler reads the prediction as JSON, parses out the entity list, and computes per-class P/R/F1 + micro / macro aggregates against the reference entity list. True positive requires identical (type, start, end); off-by-one boundaries count as miss + hallucination. Same handler runs field_match mode for invoice / form extraction where the entities are field-name keyed rather than character-offset keyed.

RAGHandler

Three signals because three things can be wrong: answer quality (EM/F1), faithfulness (fraction of prediction tokens grounded in the retrieved context), context recall (fraction of gold-answer tokens present in the context). Low faithfulness + high context recall = generator hallucinating. High faithfulness + low context recall = retriever missing. Both low = both broken. Both high = working. The decomposition turns "RAG is bad" into a specific Jira ticket.

AlignmentHandler (DPO / ORPO)

F1 against the chosen completion and F1 against the rejected completion. A row is "preference correct" if F1(prediction, chosen) > F1(prediction, rejected). Mean alignment margin across rows tells you whether DPO actually moved the model. Falls back to plain EM/F1 against the chosen completion when the rejected column is absent, so a project that transitions from SFT to DPO doesn't need an eval pack rewrite.

QAHandler · CoT span extraction

CoT-trained models emit "…reasoning…Therefore: Paris." The handler scans for end-of-reasoning markers (Final answer:, Answer:, Therefore:, The answer is …) and scores the extracted span. Without this, SQuAD F1 against a one-word reference reports near-zero on predictions that are correct. The handler annotates each prediction with answer_span + span_marker so the UI shows what got extracted, not just the score.

Seq2SeqHandler

Sub-task-aware. Translation: BLEU + chrF (via sacrebleu). Summarization: ROUGE-1 / ROUGE-2 / ROUGE-L. Paraphrase: both. Plus a length_ratio on every sub-task so you can spot over- or under-generation independent of content quality. Legacy exact_match / f1 aliases keep gate policies working without a pack migration.

VisionLanguage / AudioTranscript / SafetyHandler

Captioning + VQA get CIDEr / METEOR / SPICE; speech-to-text gets WER / CER; safety gets refusal pass-rate against an injection-prompt suite. Each is a small file (~200 lines) but the modular handler boundary lets us extend without touching the eval runner.

Adding a new handler is one file

The registry pattern keeps the surface small:

class MyHandler:
    profile_id: str = "my_task"

    def build_prompts(self, rows, ctx): ...
    def score(self, predictions, ctx): ...

register_handler("my_task", MyHandler)

Pick a stable profile id, implement the two methods, register at import time. Set the prepared manifest's task_profile to your id and the dispatcher picks it up. No edits to the eval runner, no edits to the gate engine.

The bug stayed fixed

After we shipped the span-set handler, the PII model's eval F1 went from 0.04 to 0.91 — matching what the spot-checks had been suggesting all along. The model didn't change. The metric did.

The general lesson: every time you find yourself reaching for a "weighted average" metric across heterogeneous tasks, you're probably about to invent the next "F1 decimated by hallucinated address" bug. The fix is structural separation, not a smarter average.