What does preference data consist of?

(prompt, chosen, rejected) — two responses with one preferred

How does DPO differ from classic RLHF?

It optimizes preferences directly, without training a separate reward model

What does ORPO fold together?

SFT and preference optimization into a single stage (no separate reference model)

When should you prefer SFT over DPO/ORPO?

When there's a single correct answer (classification, extraction, format)

Track 4 · Advanced · Lesson 5

Preference tuning: DPO and ORPO

After this lesson you can explain what preference data looks like, the intuition behind DPO's chosen-over-rejected objective, how ORPO combines SFT and preference in one stage, and when preference tuning helps over plain SFT.

Level: advanced Read time: ~11 min Prerequisites: Knowledge distillation III: did it work? Quality retained

SFT optimizes for one right answer per prompt. But many qualities you want — helpfulness, tone, safety, conciseness — aren't a single correct string; they're a preference between candidate responses. Preference tuning optimizes that directly.

Preference data: chosen vs rejected

Instead of (prompt → completion), preference data is (prompt → chosen, rejected): two responses to the same prompt, with a human (or a judge) saying which is better.

{ "prompt":   "Explain why the sky is blue to a five-year-old.",
  "chosen":   "The sky looks blue because sunlight is made of colors, and the air...",
  "rejected": "Rayleigh scattering causes shorter wavelengths to scatter more..." }

Neither response is "wrong" — but for the audience, one is clearly better. SFT can't express that; preference tuning is built for it.

DPO: optimize the preference directly

Classic RLHF trained a separate reward model and then optimized against it with reinforcement learning — powerful but fiddly (Track 1's objectives lesson). Direct Preference Optimization (DPO) skips the reward model: it derives a loss directly from the (chosen, rejected) pairs that increases the model's relative likelihood of the chosen response over the rejected one. It uses a frozen copy of the starting model — the reference model — as an anchor, so the policy improves preferences without drifting arbitrarily far from where it started.

recipe:
  method: dpo            # routes to the AlignmentHandler
  beta: 0.1             # how hard to push toward chosen vs stay near the reference
  data: preference       # rows of {prompt, chosen, rejected}

The beta knob controls how aggressively DPO moves toward the preferred responses versus staying close to the reference model — too high and the model overfits the preferences and degrades; too low and it barely moves.

ORPO: preference in a single stage

The usual pipeline is two stages: SFT first, then DPO. ORPO (Odds Ratio Preference Optimization) folds them into one — it adds a preference term to the SFT loss, so a single training run both teaches the task and aligns to preferences, without a separate reference model or a separate SFT stage. Simpler pipeline, one pass over the data.

On the platform

Both route through the AlignmentHandler (Track 3's dispatcher) and score with an alignment metric like preference margin. And preflight will block a DPO run if the preference-training dependency is missing — a typed blocker, not a mid-run crash.

When to reach for preference tuning

Use SFT when there's a correct answer (classification, extraction, format).
Add DPO/ORPO when "better" is a judgment: tone, helpfulness, refusing badly, verbosity, style.
Prefer ORPO for a simpler one-stage pipeline; DPO when you already have a solid SFT checkpoint to align.

Key idea

Preference tuning optimizes chosen over rejected, not a single right answer. DPO does this directly (no reward model) against a frozen reference, with beta controlling how far it moves; ORPO folds preference into the SFT loss for a one-stage run. Reach for them when 'better' is a judgment, not a label.

Key terms

preference data: (prompt, chosen, rejected) triples: two responses with one marked better.
DPO: Direct Preference Optimization — derives a loss from preference pairs directly, no separate reward model.
reference model: A frozen copy of the starting model DPO anchors to, so the policy doesn't drift arbitrarily.
beta: DPO's knob for how aggressively to move toward chosen responses vs stay near the reference.
ORPO: Odds Ratio Preference Optimization — adds a preference term to the SFT loss for single-stage alignment, no reference model.
AlignmentHandler: The BrewSLM task handler that runs preference objectives and scores alignment metrics.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.