Track 1 · SFT fundamentals · Lesson 2

Choosing the training objective: SFT, DPO, ORPO, RLHF

After this lesson you can name the main training objectives, explain what each one optimizes, and pick the right one for a goal — including when to chain SFT with preference optimization.

Level: beginner Read time: ~10 min Prerequisites: What is Supervised Fine-Tuning?

SFT is one training objective: imitate a target output. Sometimes that's not the right thing to optimize — when "the right answer" is comparative ("response A is better than B") rather than a single fixed string. This lesson maps the objectives so you choose deliberately. (Prompting and RAG from Track 0 are not on this map — they don't train the model at all.)

SFT: imitate the target

SFT minimizes cross-entropy on the completion tokens — it pushes the model toward reproducing the demonstrated output. Use it when each input has a clear target: a label, an extraction, a correct answer, a desired format. It is the workhorse and almost always the first objective you run.

Preference optimization: learn what's better

For open-ended outputs ("write a helpful reply"), there's no single correct string — but humans can say which of two responses is better. Preference pairs capture that: a prompt with a chosen and a rejected response. DPO (Direct Preference Optimization) trains the model to make chosen responses more likely than rejected ones, directly from those pairs. It's typically applied after SFT to refine quality, style, or safety where "better" is a judgment, not a fact.

ORPO folds preference learning into the SFT step itself (one stage instead of SFT-then-DPO), which can be simpler when you have preference data up front.

RLHF: the heavyweight ancestor

RLHF (Reinforcement Learning from Human Feedback) trains a separate reward model from preferences, then optimizes the language model against it with reinforcement learning (PPO). It is powerful but complex and finicky to stabilize. For most teams, DPO achieves similar ends with far less machinery, which is why it has largely displaced RLHF for fine-tuning small models.

Continued pretraining: build base ability

If the problem is that the base model doesn't speak your domain's language at all, continued pretraining — more raw next-token training on a large domain corpus — comes before SFT. It's expensive and rarely the first move; most projects start from an existing base and skip straight to SFT.

Decision guide

One right answer per input → SFT. "A is better than B" judgments to refine an already-decent model → DPO (or ORPO) after SFT. A whole new domain's language, with budget → continued pretraining, then SFT. Reach for RLHF only if you specifically need a reward model.

The overwhelming majority of practical projects are SFT, occasionally followed by DPO. This Academy focuses on SFT and returns to alignment (DPO/ORPO) as an advanced topic in Track 4. Next, we dissect a single SFT example.

Key terms

Training objective
What the loss optimizes — e.g. imitate a target (SFT) or prefer one response over another (DPO).
DPO
Direct Preference Optimization: train on (chosen, rejected) pairs to prefer better responses, no reward model.
ORPO
An objective that combines SFT and preference learning in a single stage.
RLHF
Reinforcement Learning from Human Feedback: train a reward model, then optimize with RL; powerful but complex.
Preference pair
A prompt with a chosen and a rejected response, used by DPO/ORPO/RLHF.
Continued pretraining
Further raw next-token training on a domain corpus to build base ability, before SFT.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.