Tutorial 6 · End-to-end · Code review

Build a code review nitpicker with the code-review recipe

By the end of this tutorial you'll have a small language model that reads a code diff and writes 1-3 review comments in your team's voice — catching the style nits, the dropped-null-check, the missing tests, the wrong logger call, the convention drift. Not replacing human review; lifting the floor. The boring stuff gets caught automatically so your senior engineers focus on architecture and intent. Runs on a single small GPU, deployable as a GitHub Action that posts draft suggestions on every PR.

Level: intermediate Time: ~3 hours total (most of it gold-set curation from your own PR archive) Prerequisites: Tutorial 0 (Setup BrewSLM). Optional context: Tutorial 1 for the workflow shape, Task shapes for why instruction-SFT fits code review.

Before you start

This tutorial assumes BrewSLM is running locally at http://localhost:5173 with an admin user signed in. If you haven't done that yet, complete Tutorial 0 — Set up BrewSLM and your first project first. It takes ~15 minutes and is the prerequisite for every tutorial in this track.

You'll also want, before you start: ~200 (diff, review) pairs mined from your team's GitHub PR archive — the merged PRs from the last six months are the highest-signal source you have. Public datasets work as warmup but your own archive is what teaches the model your voice.

Terms you'll see in this tutorial (click to expand)

Recipe: The training-plan template you pick when creating a project. For this tutorial: code-review. Defines the base model + adapter + eval pack defaults for diff-to-review-comment generation.
Adapter: The mapping layer that converts your CSV/JSONL rows into training-ready fields. For code-review, the adapter is qa-pair — it reads (diff, review) pairs treating the diff as the input and the review comment as the output.
Gold set: Your trusted reference rows — the "exemplary" reviews from your senior engineers. Aim for 100+ rows; the recipe's min_rows_recommended is 100 but more is better when the target is voice.
Synth playbook: A platform-provided generator that expands your gold set with controlled variations. Code-review ships three: paraphrase (voice drill), hard negatives (anti-patterns), and cluster-targeted (failure-mode patching).
LLM judge: A teacher model that scores whether the candidate review hits the same critique theme as the reference review, even if the wording differs. The primary eval gate for code-review because exact-match doesn't apply when two valid reviews of the same diff use different words.
Field-match scoring: The scoring mode the code-review recipe runs in. Compares the candidate output to the reference using token-level metrics (F1, exact match) AND the LLM judge — but the judge does the heavy lifting; F1 is a sanity check.
Critique theme: What the review is fundamentally about — "missing null check", "wrong logger level", "test coverage gap", "off-spec naming convention". The LLM judge scores whether the model hit the same theme as the gold review, not whether it used the same words.
Draft suggestion mode: GitHub PR review feature where comments are posted as suggestions the reviewer can accept-with-one-click or reject. The recommended shipping pattern for a code-review model: surface the suggestion, let the human decide.

This is the BrewSLM workflow for any project where the model needs to read code and write a comment about it in someone's voice — PR review, internal lint-explanation, style-guide enforcement, security-review note generation, deprecation notice writing. The recipe stays the same; only the gold set's voice changes. Train one model per team to inherit that team's review culture.

The end state is the model you'd actually deploy alongside your existing CI: small enough to run on a self-hosted runner, fast enough to comment within seconds of the PR opening, and tuned to your team's conventions in a way no off-the-shelf tool can match.

What you'll build

A code-review model that takes a unified-diff hunk as input and produces 1-3 review comments as output. Concretely:

Input:

@@ -42,6 +42,11 @@ def handle_request(req):
     if not req.user:
         return error_response(401)

+    if req.payload == None:
+        return error_response(400)
+
+    logger.info(f"Processing request for {req.user.email}")
+
     result = process(req.payload)
     return success_response(result)

Output:

1. Use `is None` instead of `== None` for the null check — `==` calls __eq__ which can be overridden and is slower.
2. Log line leaks user email into INFO-level logs. Either drop the email or move to DEBUG to stay GDPR-compliant.
3. Consider whether a 400 is the right code here — empty payload may indicate a client retry; 422 reads better.

The model is a fine-tuned LoRA adapter on top of Qwen/Qwen2.5-Coder-1.5B-Instruct (the recipe's suggested base). Latency: 1-3 seconds per diff on a single GPU; the bigger Qwen2.5-Coder-3B-Instruct alternative reads more context and writes more nuanced reviews at 3-6 seconds. Deployable as a GitHub Action that runs on PR open / PR update.

Key idea

The code-review recipe trains the model on voice, not on the universal truth of what makes a review correct. Two reviewers writing valid critiques of the same diff will say different things in different words; the goal isn't to predict the canonical review, it's to write reviews that sound like your team. That's why the eval pack leans on the LLM judge — it scores whether the candidate hits the same critique theme as the reference, regardless of phrasing.

Why a small model (not GPT-4, not regex)

Three options exist for automated PR review. Use this comparison:

Approach	Quality	Latency in CI	Cost at scale	Privacy	Voice match
Lint / regex (eslint, ruff, custom plugins)	Catches the textbook patterns	<1s	Free	Self-hosted	Generic — every team gets the same comments
Frontier LLM via API (GPT-4, Claude)	Strong but verbose; tends to invent issues	5-15s per diff	$0.10-0.50 per PR (at 200 PRs/week that's $1k-5k/month)	Every diff leaves your network	Generic LLM voice — sounds like ChatGPT
Small fine-tuned model (this tutorial)	Strong on patterns your team flags; weaker on novel issues	1-3s	~$30/month on one GPU	Self-hosted; your code never leaves the network	Trained on your team's archive — sounds like your team

Lint catches what you can encode as rules; nobody has a regex that captures "this looks like a feature flag check but the flag is already removed from config" or "the test name implies behaviour the test doesn't verify". Frontier models are good at language but their latency makes them awkward in CI, their cost compounds at high PR volume, and shipping your proprietary code to a third-party API is a non-starter for many enterprise deployments. Small fine-tuned models occupy the gap: fast enough for CI, private enough for proprietary code, and — crucially — voice-tunable in a way no off-the-shelf tool is.

Choose your dataset

You need (diff, review) pairs. The diff is a unified-diff hunk (the bit between @@ -start,len +start,len @@ markers); the review is the comment text a human reviewer would write. Four sources, in order of value:

Your own internal PR review archive (by far the highest value): Mine 6 months of merged PRs from your team's GitHub via gh api repos/<org>/<repo>/pulls?state=closed + /pulls/<n>/comments. Filter for inline comments on diffs (not top-level PR comments). This captures voice + conventions a public dataset never will. A 50-engineer team typically produces 2-5k inline review comments per month — 6 months gives you 12-30k candidate rows, easily enough.
The CodeReviewer dataset (warmup / pretraining): Microsoft Research's CodeReviewer paper released a corpus of millions of GitHub PR reviews across languages. Useful as a warmup pass before fine-tuning on your own data — gets the model speaking "review language" generically before you teach it your voice.
CodeSearchNet (unlabeled diffs): The CodeSearchNet code corpus is unlabeled function-level code in 6 languages. Useful if you need additional diff context for the synthetic playbooks but don't have enough archive of your own. Pair with a teacher model to generate review-like comments as a bootstrap step.
Open-source merged PRs (for solo devs / new teams): If your team is too new to have a meaningful archive, mine merged PRs from large open-source projects (Linux kernel, Postgres, React, Django, Rust compiler). Public, plentiful, and the reviews are typically high-quality. Trade-off: you inherit the open-source project's voice, not yours.

How many rows do you need?

100 high-quality rows is the recipe's minimum. 300-500 is comfortable. Past 1000 you hit diminishing returns — voice converges around 500 rows. Filter aggressively before training: half your archive will be "LGTM" or "nit: fix this typo" — those are signal-weak. The valuable rows are the 1-3 sentence critiques that explain a real concern.

Ingest and map

In BrewSLM, create a new project: Projects → New Project → code-review recipe. The recipe pre-fills the adapter pick (qa-pair), task profile (instruction_sft), scoring mode (field_match), and points the eval pack scaffold at the code-review template (LLM-judge headline gate, F1 as a sanity check).

Open Data Studio → Import. Drop your CSV or JSONL. The expected shape is two required fields plus an optional rationale:

{
  "diff": "+ if (user.id == null) {\n+     return;\n+ }",
  "review": "Use `===` for null check, or `if (user.id != null)` to also catch undefined.",
  "rationale": "JS equality nit; the loose-equality form coerces both null and undefined."
}

The mapping picker will scan the columns and match them against the recipe's shape signature (input columns named like diff/patch/code/file/snippet, output columns like review/comment/feedback/suggestion). Click Apply mapping when the preview looks right.

✓ Checkpoint: the Data Studio Overview now shows your imported row count, and the Sources panel lists your import with a green status badge. The goal ledger's data_ready row turns green once row count crosses 100; below that it shows amber with a "more data recommended" hint. If the mapping is wrong, the preview rows will be empty or mis-shaped — click Edit mapping on the panel and pick the correct columns.

Don't strip the diff context

It's tempting to feed just the changed lines (the +/- rows) to the model. Resist that. The hunk header (@@ -42,6 +42,11 @@ def handle_request(req):) and the surrounding unchanged lines are critical context — a critique like "this null check is redundant because line 41 already validated req.user" requires the model to see line 41. Keep at least 3 lines of context above and below; the standard git diff -U3 format is the right default.

Cleanup

Open Data Studio's Quality & Safety panel. For code-review data, the cleanup pass is mostly about noise removal:

Strip auto-generated comments. Your archive will include CI-bot comments ("✅ Build passed", "Codecov: -0.2%"), Dependabot dependency-bump notices, Snyk security-scan summaries, copilot suggestions accepted-as-comments. None of these are reviews; all of them are noise. Filter by author handle (drop rows where commenter ends in [bot]) and by content heuristics (drop rows starting with markdown badge images).
Drop LGTM-only rows. "LGTM", "ship it", "👍", "approve" — these are signal-zero for training a model to write reviews. They're useful as hard negatives (see the synth section below) but should not be in your positive gold set.
Normalise diff whitespace. Mixed line endings (CRLF vs LF), trailing whitespace on diff lines, smart-quotes in review text. The platform's dedup signal in Data Studio will catch many of these as "near-duplicates" — review one cluster at a time.
Deduplicate near-identical reviews. The same engineer often writes the same review on similar diffs ("add a docstring", "extract this magic number"). Keep one canonical instance per pattern — train/test leakage on duplicate reviews inflates F1 scores artificially.
PII in commit history. Reviews sometimes quote internal Slack messages, customer names, or production credentials accidentally pasted in. Walk the Quality & Safety panel's PII flags and decide per-row.

Skip nothing here. The code-review model inherits whatever signal you don't filter — if half your training rows are bot comments, the model will learn to write bot comments.

Pick the recipe: code-review or qa-sft or classification?

BrewSLM ships several recipes that could plausibly handle review-style tasks. Use this decision tree:

Question	code-review	qa-sft	classification
Output is free-text 1-3 sentences?	✓ (the recipe's primary shape)	✓ (works but no diff-specific scaffold)	✗ (classification emits a label, not text)
"Correct" review varies between reviewers?	✓ (LLM-judge eval handles many-valid-answers)	(F1-led eval penalises paraphrase)	(forced into a label vocabulary)
Need to capture team voice / conventions?	✓ (gold seeded from your archive)	(possible but no voice-targeted playbooks)	✗ (label vocabulary can't hold voice)
Diff is the input shape?	✓ (recipe defaults to `diff` column)	(generic input)	(generic input)
Decision is "block / approve / nit"?	(model produces text; classification fits decisions)	(awkward)	✓ (3-class label)
Just want a yes/no "is this code risky"?	✗ (overkill)	✗	✓ (binary classifier)

For the canonical "comment on diffs in your team's voice" use case: code-review wins because (a) the LLM-judge eval handles the "many valid reviews" problem that breaks F1-led recipes, (b) the scaffold expects a diff input + review output without forcing you to invent mapping, and (c) the three code-review playbooks target voice, anti-patterns, and failure clusters specifically. Sticking with code-review for the rest of this tutorial.

Domain packs (the language-specific gap)

BrewSLM doesn't ship a code-review domain pack today — the platform's curated packs (legal, support, ecommerce, healthcare) are content domains, not engineering domains. For code-review you're operating on the recipe defaults, which is fine for v1.

A worthwhile follow-up project is to build a custom pack that bundles your team's style guide as prompt context — the synth playbooks accept extra context, so a style_guide.md excerpt injected into the playbook prompt teaches the synth-generated reviews to align with your conventions. The pack would also bundle: cleaning rules tuned to your CI bot author handles, an LLM-judge rubric that calls out your specific conventions ("does the review match our naming convention for boolean flags?"), and language-specific severity ladders ("nit / suggestion / blocker" for JS vs "info / warn / error" for Go).

If you're shipping this to a security team or to multiple language verticals (one model per language), packaging the conventions as a domain pack pays back fast. The platform's pack framework is general-purpose; the recipe is one input among many.

Build the gold set — exemplary reviews + LLM-assisted promotion

The gold set is where the model's voice comes from. Two complementary paths, both worth running:

Path A — manual seeding from "exemplary" reviews

Pick your 3-5 senior engineers (the ones whose reviews other engineers learn from). Walk their last 50 PR reviews each. Grab the inline comments where:

The reviewer raised a real concern with a concrete suggestion.
The PR author acknowledged the comment (replied "good catch" / "fixed in next push" / made the change).
The comment is 1-3 sentences — long enough to explain, short enough to read in a glance.

Open the Gold Set workbench (Data Studio → Gold Set). For each row paste the diff context (4-6 lines around the comment line) into diff, paste the comment verbatim into review, and add a one-line rationale tag (style-nit, bug-suggestion, test-gap, convention, security). Spend 60 minutes here. These ~50-100 rows are the most important data in the project — they imprint voice on the rest.

Path B — LLM-assisted promotion from your full archive

Your full PR archive is too large to hand-curate. For the bulk pass, an out-of-band scoring step works well — the platform doesn't bundle a "rank by quality" UI, so you'll script this against your teacher backend directly:

Bulk-import your filtered PR review archive (post-cleanup) into the project.
Write a short script that POSTs each (diff, review) pair to your teacher (Ollama / OpenAI / Anthropic — whichever you've set up as a synth backend). Prompt the teacher to score the pair on three axes (concrete suggestion present, voice matches team norm, comment is actionable) and return a 0-3 score per axis. Aggregate the scores per row.
Sort by aggregate score; promote the top 20% into the gold set by importing them as a curated subset. The rest stay in the project's raw rows but are skipped at training time. Promotion is still an explicit user decision via the gold-set workbench — the platform won't auto-promote teacher-scored rows on its own.

This compresses 8 hours of manual gold-curation into ~90 minutes of scripting + review. The trade-off: the teacher's score is a proxy, not a verdict. Spot-check 30 of the promoted rows manually before training; if more than 20% of those feel off-voice, tighten the teacher prompt and re-rank.

✓ Checkpoint: the Data Studio Overview's Gold Set ready row should show ≥ 100 rows (green) or 50-99 rows (amber, "100 recommended"). Below 50 the goal ledger flags this as a blocker on the gold-set component. If the row is grey, nothing got promoted — check the Gold Set workbench page to confirm rows actually landed there.

Don't skip the rationale tags

The rationale field is optional in the schema but valuable in practice. Tagging each gold row by category (style-nit / bug-suggestion / test-gap / convention / security / perf) lets the FailureClustersPanel group eval failures by category later — "the model is strong on style nits but misses test-coverage gaps". That diagnostic is impossible without the tags.

Splitting train, validation, test — by repo or author, not random

This is where the code-review recipe diverges sharpest from tutorial 1 and 2. Don't use a random split. Reviews by the same engineer are stylistically correlated; the same engineer's reviews appearing in both train and test inflates eval scores because the model can essentially "recognise" the reviewer's voice rather than generalising the critique reasoning.

The platform's Prepare Dataset surface today produces a random split off the canonical splits config (train_ratio / val_ratio / test_ratio / seed). For code-review specifically you want stronger guarantees, which means doing the partitioning before import: split your archive into three CSV/JSONL files yourself (train.jsonl / val.jsonl / test.jsonl) using a deterministic key, and import the three files as three separate datasets in the project. Three useful keys for code-review:

Split by repository. All rows from repo A go to train; all rows from repo B go to validation; all rows from repo C go to test. Best when your team works across many repos with overlapping author sets — the model has to generalise across codebases. Use when you have ≥ 5 source repos.
Split by author. Engineer A's reviews go to train; engineer B's to validation; engineer C's to test. Best when one repo dominates your archive — the model has to generalise across reviewers. Use when you have ≥ 5 distinct reviewers contributing comments.
Split by time. Reviews from Jan-Apr go to train; May to validation; June to test. The most realistic for measuring deployment readiness because production traffic is "later than training data". Use as a complementary check even when you've already done one of the above.

For a 400-row gold set, 80/10/10 produces 320 train / 40 val / 40 test. With a key-based split the per-split counts may be uneven — that's fine, the eval pack handles small validation sets gracefully and the LLM-judge scoring works at any size.

Random split = inflated F1 + LLM-judge scores

If you skip this step and use the default random split, expect to see eval scores 10-15 points higher than what you'll observe in production. That gap is the leakage signal — your model "knows" the reviewer rather than the critique. Worse, you won't notice the inflation until you ship and the model starts writing reviews that don't sound like anyone on your team. Spend the extra 5 minutes configuring a key-based split.

Generate synthetic drills

The code-review recipe ships three synthetic playbooks. Each drills a different aspect of review-quality:

code_review_paraphrase (POSITIVES_PARAPHRASE) — voice drill: Vary the wording of a review while keeping the underlying critique intact. "Use is None not == None" becomes "Prefer is None here — the loose-equality form coerces and is slower." Same critique theme, different phrasing. Goal: the model learns that the same diff can elicit many valid review wordings, all of which should hit the same theme. Generate ~80 rows from your gold set.
code_review_hard_negatives (HARD_NEGATIVES) — what NOT to emit: Two flavours for code-review specifically: (a) "LGTM" reviews paired with diffs that have real issues — the model needs to learn that empty-praise reviews are wrong; (b) reviews that MISS the actual bug while commenting on irrelevant style nits. Both train the model on the failure modes you want to avoid: silence when speaking up matters, and pedantry that misses the real concern. Generate ~50 rows.
code_review_cluster_targeted (CLUSTER_TARGETED) — failure-mode patching: After a first training run + eval, the FailureClustersPanel groups failures into themes ("missed null checks", "wrong logger level", "off-voice"). This playbook seeds new training rows specifically targeting the weakest clusters — you point it at a cluster, it generates diffs that exercise the same pattern, and a teacher writes reviews. Skip on the first pass; run after the first eval to close the gap.

Open Data Studio → Synthetic → Playbook Center. The code-review recipe surfaces three playbook cards. Click code_review_paraphrase first, set target count to 80, pick a backend (Ollama is the free default; OpenAI / Anthropic give noticeably better paraphrases at code-review's voice). Generation runs as a background Job; the notification bell tracks progress.

Don't run cluster-targeted first

The cluster-targeted playbook needs a baseline model + an eval run to know which clusters are weak. Run paraphrase + hard-negatives first; train + eval the v1 model; then use the FailureClustersPanel's "Generate targeted drills" button to feed the weakest cluster back into cluster-targeted. This is iteration 2, not iteration 1.

Review the synth queue

Every generated row lands in the Synthetic Review Queue with review_status="pending". The queue groups rows by source playbook so you can review one category at a time. For code-review specifically, expect 30-50% soft-reject rate on the first pass — teacher models are stronger at code than at voice, so paraphrased reviews drift toward "ChatGPT voice" easily.

Per-row actions and reason tags:

Accept — the row joins training on the next prepare-dataset run. For paraphrase rows: only accept if the new wording still sounds like something your team would say. "Sounds like a teacher model" is a reason to reject.
Soft-reject with a reason tag:
- not-our-voice — too formal, too verbose, too GPT-ey. The most common rejection on paraphrase.
- misses-the-bug — review is grammatically fine but doesn't catch the real concern in the diff. Common on hard-negatives gone wrong.
- too-pedantic — review flags something true but trivial; your team wouldn't bother commenting.
- wrong-language — review references a syntax that doesn't apply to the diff's language (Python advice on JS code, etc.).
- invented-issue — review describes a problem that isn't actually in the diff; the model would learn to hallucinate concerns.
Purge by reason — group-level action that deletes all rows tagged with one reason. Periodic cleanup; the platform never auto-purges.

The review queue is also where the platform's per-row confidence score surfaces. Paraphrase rows scoring below 50% confidence almost always correlate with voice drift — highlight them and review those first.

Training configuration

Open Training → New Experiment. The recipe defaults are sensible:

Base model: Qwen/Qwen2.5-Coder-1.5B-Instruct. Pre-trained on code; instruction-tuned; good fit for the diff-to-comment shape. The 1.5B size fits comfortably on an 8 GB GPU. Alternative: Qwen/Qwen2.5-Coder-3B-Instruct for teams that want longer / more nuanced reviews and have a bigger GPU (12 GB+) — noticeably better at multi-issue diffs where the model needs to spot 2-3 issues in one diff. HuggingFaceTB/SmolLM2-1.7B-Instruct is the non-coder alternative if you don't want a code-pretrained base for some reason.
Adapter: LoRA, rank 16, alpha 32, target modules q_proj,k_proj,v_proj,o_proj. Standard for instruction-SFT.
Learning rate: 2e-4. Same as the QA-shape recipes; LoRA's small parameter count tolerates the higher rate.
Epochs: 3. For 300-500 row training sets, 3 epochs hits the plateau; 5+ overfits the voice. The training panel's live-loss sparkline will plateau by step 200-300 if the dataset is healthy.
Batch size + gradient accumulation: Batch 2, accumulate 8 → effective batch 16. Diffs can be long (up to the 2-3k token context window); the smaller per-step batch keeps memory in check.

Expected runtime: 15-40 minutes on a single GPU (RTX 3060+ for the 1.5B, RTX 3090+ for the 3B), 60-90 minutes on CPU for the 1.5B (CPU is impractical for the 3B). Watch the live signals panel: if loss isn't dropping by step 100, kill the run via the experiment row's kill switch and check your gold set — the input data is probably misshapen (most common cause: diff column accidentally swapped with review column during mapping).

✓ Checkpoint: in the Training tab, your experiment row shows a live sparkline that drops from ~2.5-3.5 in the first few steps down to ~0.4-0.7 by the end. The bell shows a "training" notification with a percentage. When complete, the experiment row turns green and the experiment detail page surfaces the final loss + a "Run evaluation" button.

Read the trainability forecast

Before kicking off the training run, the platform pre-computes a trainability forecast on the goal ledger's predicted_pass row. For code-review projects the forecast is noisier than for classification or QA — the LLM-judge gate has higher inherent variance than exact-match metrics because it depends on a teacher model's scoring of theme overlap, and theme overlap is fuzzier than token overlap.

For a healthy code-review project you want:

Predicted pass probability ≥ 55%. Lower than the rag-protocol / classification thresholds because of the LLM-judge variance — a 55% prediction here is roughly equivalent to 70% on a classification project.
Gold set readiness ≥ 100% (i.e. ≥100 gold rows). Below this the forecast becomes too noisy to trust at all.
Data ready = met (training rows + mapping + key-based split all green).

If the forecast is below 40%, training will likely fail the LLM-judge gate. Add more gold + run another paraphrase round — the goal ledger's blockers panel will tell you which component is weakest.

Why the forecast is noisier for LLM-judge tasks

Trainability forecasts are calibrated on the gate type. F1 and exact-match are deterministic functions of (prediction, reference) — the forecast has a tight error band. LLM-judge introduces a second stochastic process (the judge model's output), and that stochasticity stacks with the model-under-eval's stochasticity. The forecast is still useful as a directional signal ("we're nowhere near ready" vs "we're close enough to try") but treat the absolute number with a wider error band than you'd treat a classification forecast.

Evaluation: LLM-judge as the primary gate

After training, the platform automatically evaluates against the project's eval pack. For code-review projects the scaffold gates three behaviours:

LLM-judge pass rate ≥ 0.70 (required): The headline gate. A teacher model scores each candidate review against the reference review along three axes: (a) does the candidate identify the same critique theme as the reference, (b) is the suggestion actionable and concrete, (c) does the candidate's voice match the team norm. Pass = aggregate score crosses the threshold. This is the gate you actually care about.
F1 ≥ 0.40 (not required): Token-level overlap between candidate and reference. Low because reviews paraphrase — two valid reviews of the same diff might share 30% of their tokens. Don't expect classification-style accuracy here. A model with F1 = 0.45 and LLM-judge pass = 0.78 is shipping-ready; a model with F1 = 0.75 and LLM-judge pass = 0.55 is over-memorising the gold and not generalising voice.
Safety pass rate ≥ 0.93 (not required): Catches reviews that leak PII, suggest unsafe code, or use language inappropriate for your team's communication norms. Mostly defensive; should pass easily if your gold set is clean.

The goal ledger expands the eval_pass_rate row into a per-gate breakdown — "LLM-judge 0.72/≥0.70 passed, F1 0.43/≥0.40 passed, safety 0.96/≥0.93 passed." If LLM-judge passes but F1 is high (≥ 0.70), the model is likely over-memorising your gold; a too-perfect F1 on a generative task is a warning sign, not a success.

The FailureClustersPanel groups eval failures by theme for code-review specifically. Common cluster headers:

"missed the bug" — model commented on style while the diff had a real issue. The hard-negatives playbook targets exactly this.
"off-voice" — model wrote a syntactically-fine review that doesn't sound like your team. Add more gold from your senior engineers' archive.
"too verbose" — model wrote a 6-sentence essay when 2 sentences would do. Tighten gold to shorter exemplars; run paraphrase to compress.
"invented issue" — model flagged a concern that isn't in the diff. The most worrying cluster — train on hard-negatives that include diffs with no real issues paired with appropriate-silence reviews.

When the eval fails

Common code-review-specific failure patterns and the fix for each:

Symptom	Root cause	Fix
LLM-judge pass < 0.50, model always says "LGTM"	Hard-negatives playbook didn't run — model learned that approving is the safe default	Run code_review_hard_negatives for 50+ rows targeting LGTM-when-bug-exists. Retrain.
LLM-judge pass 0.60-0.65, FailureClustersPanel surfaces "invented issue" as dominant cluster	Model is hallucinating bugs that aren't there	Add gold rows that are diffs the team explicitly did NOT comment on — silence-is-correct exemplars. Reduce paraphrase rows that may have drifted toward invented issues.
LLM-judge pass > 0.70 but production reviews "don't sound like us"	Voice drift in synth rows you accepted	Audit the last paraphrase round — the synth queue keeps soft-rejected rows on disk. Look for patterns in `not-our-voice` rejections you might have missed. Re-curate gold.
F1 unusually high (> 0.75) but LLM-judge plateaus at 0.55	Model memorised gold's wording without learning to generalise — over-fit on small set	Reduce epochs (try 2 instead of 3). Add more diversity to gold — every gold row should be from a distinct PR.
Model adopts wrong voice register (formal where team is casual, or vice versa)	Gold set is mixed-register; teacher imprinted on the loudest signal	Filter gold to ONE register (pick the dominant team voice and stick with it). Re-train. Voice consistency matters more than coverage on the first pass.
Model misses domain-specific anti-patterns (e.g. uses incorrect logger call, doesn't flag a SQL N+1)	Anti-patterns are under-represented in gold	Use code_review_cluster_targeted seeded from the FailureClustersPanel's specific weak cluster. Generate 30+ diffs that exercise the missed pattern. Retrain.
LLM-judge pass varies 10+ points between runs without retraining	LLM-judge variance (the gate is inherently noisier than F1)	Run eval 3 times; report median. If variance persists, switch the judge backend to a stronger model (Anthropic/OpenAI vs Ollama) for more stable scoring.

The honest move

Code-review is one of the harder use cases to evaluate because "correct" is fuzzy. If you've iterated three times and the LLM-judge pass rate is stuck around 0.60-0.65, the bottleneck is probably the gold set, not the training config. Adding 50 more high-quality gold rows from your senior engineers' archive will move the metric more than three more rounds of synth + retrain. Don't grind on epochs when the answer is in the data.

Ship the model

Three deployment patterns for code-review models, from least to most assertive:

Draft suggestion mode (recommended first): Deploy as a GitHub Action that posts each model-generated comment as a suggestion the human reviewer can accept-with-one-click or dismiss. This is GitHub's native review-suggestion mechanic (the syntax with ```suggestion blocks in PR comments). The model is offering input; the human is the decider. Zero false-positive cost — bad suggestions get dismissed and stay invisible to the PR author. This is where you should START.
Tee mode (review-assist): Model posts comments visibly on the PR but tagged as [bot] with a "from code-review-bot — accept or dismiss" footer. Human reviewers see the suggestions alongside their own pass. Useful once draft-suggestion-mode acceptance rates climb above ~30% and the team trusts the model's signal.
Blocking review mode (after months of tuning): Model's comments are posted as standard review comments equivalent to a human review. The PR is held until either the author addresses the comment or a human reviewer dismisses it. Don't do this until you have at least 3 months of draft-suggestion deployment data showing acceptance rate ≥ 50%. The cost of a bad blocking comment is high (slows the team); the cost of a missed dismissed suggestion is zero.

Export and deploy via the recipe's target_profile (defaults to vllm_server):

cd data/projects/<id>/exports/run-2026-06-05
./deploy-vllm.sh
# Loads Qwen2.5-Coder-1.5B + LoRA adapter on localhost:8000.
# POST /review with { "diff": "..." } returns { "review": "1. ...\n2. ..." }
# Latency: 1-3s on a single GPU.

Wire it as a GitHub Action that runs on PR open / synchronize. The Action pulls the diff via gh pr diff, posts to the model's /review endpoint, and writes the response as a draft-suggestion review comment via gh pr review --comment. Self-hosted runners with GPU access keep the round-trip private; the model never sees the internet, and your diffs never leave your network.

Floor-lifting, not ceiling-raising

This model exists to catch the boring stuff so humans focus on the interesting stuff. It is not a substitute for human review — a 1.5B parameter model running on a self-hosted GPU is going to miss architectural concerns, design intent issues, and anything that requires understanding the wider codebase. Keep your senior engineers in the loop. The deployment plan that turns off human review because "the model handles it now" is the deployment plan that ends up causing the outage that gets your team a meeting with the VP.

What's next

You have a deployed code-review nitpicker. Three obvious next moves:

Language-specific variants: One model handles all your languages OK; one model per language (Python, Go, TypeScript) handles each one noticeably better. Same recipe + workflow; segment your gold set by language and train per-segment. Routing happens at deploy time — the GitHub Action inspects the diff's file extensions and picks the right model.
Pull style-guide context via auto-RAG: Your team's style guide is a stable document. Index it as an auto-RAG corpus and prepend retrieved chunks to the diff at inference time. The model writes reviews informed by the exact style-guide rules — "according to section 3.2 of the team's Python guide, use pathlib not os.path for new code." Works without retraining the model.
Active learning from acceptance signal: Every draft-suggestion the human reviewer accepts is a high-signal positive example. Every dismissed suggestion is a soft-negative. Capture both via webhook from your PR platform; promote the accepted ones to gold; retrain when gold grows by ~100 rows. Over a quarter, the model converges on the suggestions your team actually finds useful — a continuous-improvement loop powered by the review process itself.

Curious where to go after this? The tutorials hub covers other end-to-end recipes — span-extraction for structured field extraction, classification for binary security flags, rag-protocol for grounded-answer chatbots. Each one is the same workflow shape on a different recipe.

Key terms

code-review recipe: BrewSLM recipe that trains a small model to comment on code diffs in a team's voice. Uses the qa-pair adapter, instruction_sft task profile, and an LLM-judge-led eval pack scaffold.
LLM judge: A teacher model that scores whether a candidate review hits the same critique theme as the reference review, even if the wording differs. The primary eval gate for code-review because exact-match doesn't apply when many wordings are valid.
Critique theme: What the review is fundamentally about — "missing null check", "wrong logger level", "off-spec naming convention". The thing the LLM judge scores theme-overlap on.
Split by repo / author / time: Stratified splitting strategies that prevent reviewer-voice leakage between train and test. Random split inflates eval scores 10-15 points by letting the model recognise reviewers rather than generalise critique reasoning.
Hard negatives (code-review flavour): LGTM reviews on diffs that have real issues, plus reviews that flag style nits while missing the actual bug. Trains the model on the failure modes (silence when speaking up matters, pedantry that misses the point) you want to avoid.
Draft suggestion mode: GitHub's PR-review feature where comments are posted as suggestions the reviewer can accept-with-one-click or dismiss. The recommended starting deployment pattern — zero false-positive cost because bad suggestions never reach the PR author.
code_review_cluster_targeted: Synthetic playbook that generates training rows targeting the weakest cluster surfaced by the FailureClustersPanel after a baseline eval. Iteration-2 tool, not iteration-1.

Check yourself

Answers are saved to this browser.

← All tutorials