Choosing the training objective: SFT, DPO, ORPO, RLHF
After this lesson you can name the main training objectives, explain what each one optimizes, and pick the right one for a goal — including when to chain SFT with preference optimization.
SFT is one training objective: imitate a target output. Sometimes that's not the right thing to optimize — when "the right answer" is comparative ("response A is better than B") rather than a single fixed string. This lesson maps the objectives so you choose deliberately. (Prompting and RAG from Track 0 are not on this map — they don't train the model at all.)
SFT: imitate the target
SFT minimizes cross-entropy on the completion tokens — it pushes the model toward reproducing the demonstrated output. Use it when each input has a clear target: a label, an extraction, a correct answer, a desired format. It is the workhorse and almost always the first objective you run.
Preference optimization: learn what's better
For open-ended outputs ("write a helpful reply"), there's no single correct string — but humans can say which of two responses is better. Preference pairs capture that: a prompt with a chosen and a rejected response. DPO (Direct Preference Optimization) trains the model to make chosen responses more likely than rejected ones, directly from those pairs. It's typically applied after SFT to refine quality, style, or safety where "better" is a judgment, not a fact.
ORPO folds preference learning into the SFT step itself (one stage instead of SFT-then-DPO), which can be simpler when you have preference data up front.
RLHF: the heavyweight ancestor
RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from preferences, then optimizes the language model against it with reinforcement learning (PPO). It is powerful but complex and finicky to stabilize. For most teams, DPO achieves similar ends with far less machinery, which is why it has largely displaced RLHF for fine-tuning small models.
Continued pretraining: build base ability
If the problem is that the base model doesn't speak your domain's language at all, continued pretraining — more raw next-token training on a large domain corpus — comes before SFT. It's expensive and rarely the first move; most projects start from an existing base and skip straight to SFT.
Decision guide
One right answer per input → SFT. "A is better than B" judgments to refine an already-decent model → DPO (or ORPO) after SFT. A whole new domain's language, with budget → continued pretraining, then SFT. Reach for RLHF only if you specifically need a reward model.
The overwhelming majority of practical projects are SFT, occasionally followed by DPO. This Academy focuses on SFT and returns to alignment (DPO/ORPO) as an advanced topic in Track 4. Next, we dissect a single SFT example.
Key terms
- Training objective
- What the loss optimizes — e.g. imitate a target (SFT) or prefer one response over another (DPO).
- DPO
- Direct Preference Optimization: train on (chosen, rejected) pairs to prefer better responses, no reward model.
- ORPO
- An objective that combines SFT and preference learning in a single stage.
- RLHF
- Reinforcement Learning from Human Feedback: train a reward model, then optimize with RL; powerful but complex.
- Preference pair
- A prompt with a chosen and a rejected response, used by DPO/ORPO/RLHF.
- Continued pretraining
- Further raw next-token training on a domain corpus to build base ability, before SFT.
Check yourself
Answers are saved to this browser.