Track 4 · Advanced · Lesson 5

Preference tuning: DPO and ORPO

After this lesson you can explain what preference data looks like, the intuition behind DPO's chosen-over-rejected objective, how ORPO combines SFT and preference in one stage, and when preference tuning helps over plain SFT.

Level: advanced Read time: ~11 min Prerequisites: Knowledge distillation III: did it work? Quality retained

SFT optimizes for one right answer per prompt. But many qualities you want — helpfulness, tone, safety, conciseness — aren't a single correct string; they're a preference between candidate responses. Preference tuning optimizes that directly.

Preference data: chosen vs rejected

Instead of (prompt → completion), preference data is (prompt → chosen, rejected): two responses to the same prompt, with a human (or a judge) saying which is better.

{ "prompt":   "Explain why the sky is blue to a five-year-old.",
  "chosen":   "The sky looks blue because sunlight is made of colors, and the air...",
  "rejected": "Rayleigh scattering causes shorter wavelengths to scatter more..." }

Neither response is "wrong" — but for the audience, one is clearly better. SFT can't express that; preference tuning is built for it.

DPO: optimize the preference directly

Classic RLHF trained a separate reward model and then optimized against it with reinforcement learning — powerful but fiddly (Track 1's objectives lesson). Direct Preference Optimization (DPO) skips the reward model: it derives a loss directly from the (chosen, rejected) pairs that increases the model's relative likelihood of the chosen response over the rejected one. It uses a frozen copy of the starting model — the reference model — as an anchor, so the policy improves preferences without drifting arbitrarily far from where it started.

recipe:
  method: dpo            # routes to the AlignmentHandler
  beta: 0.1             # how hard to push toward chosen vs stay near the reference
  data: preference       # rows of {prompt, chosen, rejected}

The beta knob controls how aggressively DPO moves toward the preferred responses versus staying close to the reference model — too high and the model overfits the preferences and degrades; too low and it barely moves.

ORPO: preference in a single stage

The usual pipeline is two stages: SFT first, then DPO. ORPO (Odds Ratio Preference Optimization) folds them into one — it adds a preference term to the SFT loss, so a single training run both teaches the task and aligns to preferences, without a separate reference model or a separate SFT stage. Simpler pipeline, one pass over the data.

On the platform

Both route through the AlignmentHandler (Track 3's dispatcher) and score with an alignment metric like preference margin. And preflight will block a DPO run if the preference-training dependency is missing — a typed blocker, not a mid-run crash.

When to reach for preference tuning

Key idea

Preference tuning optimizes chosen over rejected, not a single right answer. DPO does this directly (no reward model) against a frozen reference, with beta controlling how far it moves; ORPO folds preference into the SFT loss for a one-stage run. Reach for them when 'better' is a judgment, not a label.

Key terms

preference data
(prompt, chosen, rejected) triples: two responses with one marked better.
DPO
Direct Preference Optimization — derives a loss from preference pairs directly, no separate reward model.
reference model
A frozen copy of the starting model DPO anchors to, so the policy doesn't drift arbitrarily.
beta
DPO's knob for how aggressively to move toward chosen responses vs stay near the reference.
ORPO
Odds Ratio Preference Optimization — adds a preference term to the SFT loss for single-stage alignment, no reference model.
AlignmentHandler
The BrewSLM task handler that runs preference objectives and scores alignment metrics.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.