Preference tuning: DPO and ORPO
After this lesson you can explain what preference data looks like, the intuition behind DPO's chosen-over-rejected objective, how ORPO combines SFT and preference in one stage, and when preference tuning helps over plain SFT.
SFT optimizes for one right answer per prompt. But many qualities you want — helpfulness, tone, safety, conciseness — aren't a single correct string; they're a preference between candidate responses. Preference tuning optimizes that directly.
Preference data: chosen vs rejected
Instead of (prompt → completion), preference data is (prompt → chosen, rejected): two responses to the same prompt, with a human (or a judge) saying which is better.
{ "prompt": "Explain why the sky is blue to a five-year-old.",
"chosen": "The sky looks blue because sunlight is made of colors, and the air...",
"rejected": "Rayleigh scattering causes shorter wavelengths to scatter more..." }
Neither response is "wrong" — but for the audience, one is clearly better. SFT can't express that; preference tuning is built for it.
DPO: optimize the preference directly
Classic RLHF trained a separate reward model and then optimized against it with reinforcement learning — powerful but fiddly (Track 1's objectives lesson). Direct Preference Optimization (DPO) skips the reward model: it derives a loss directly from the (chosen, rejected) pairs that increases the model's relative likelihood of the chosen response over the rejected one. It uses a frozen copy of the starting model — the reference model — as an anchor, so the policy improves preferences without drifting arbitrarily far from where it started.
recipe:
method: dpo # routes to the AlignmentHandler
beta: 0.1 # how hard to push toward chosen vs stay near the reference
data: preference # rows of {prompt, chosen, rejected}
The beta knob controls how aggressively DPO moves toward the preferred responses versus staying close to the reference model — too high and the model overfits the preferences and degrades; too low and it barely moves.
ORPO: preference in a single stage
The usual pipeline is two stages: SFT first, then DPO. ORPO (Odds Ratio Preference Optimization) folds them into one — it adds a preference term to the SFT loss, so a single training run both teaches the task and aligns to preferences, without a separate reference model or a separate SFT stage. Simpler pipeline, one pass over the data.
On the platform
Both route through the AlignmentHandler (Track 3's dispatcher) and score with an alignment metric like preference margin. And preflight will block a DPO run if the preference-training dependency is missing — a typed blocker, not a mid-run crash.
When to reach for preference tuning
- Use SFT when there's a correct answer (classification, extraction, format).
- Add DPO/ORPO when "better" is a judgment: tone, helpfulness, refusing badly, verbosity, style.
- Prefer ORPO for a simpler one-stage pipeline; DPO when you already have a solid SFT checkpoint to align.
Key idea
Preference tuning optimizes chosen over rejected, not a single right answer. DPO does this directly (no reward model) against a frozen reference, with beta controlling how far it moves; ORPO folds preference into the SFT loss for a one-stage run. Reach for them when 'better' is a judgment, not a label.
Key terms
- preference data
- (prompt, chosen, rejected) triples: two responses with one marked better.
- DPO
- Direct Preference Optimization — derives a loss from preference pairs directly, no separate reward model.
- reference model
- A frozen copy of the starting model DPO anchors to, so the policy doesn't drift arbitrarily.
- beta
- DPO's knob for how aggressively to move toward chosen responses vs stay near the reference.
- ORPO
- Odds Ratio Preference Optimization — adds a preference term to the SFT loss for single-stage alignment, no reference model.
- AlignmentHandler
- The BrewSLM task handler that runs preference objectives and scores alignment metrics.
Check yourself
Answers are saved to this browser.