What are the three distinct techniques the term 'reasoning training' covers?

Chain-of-thought SFT (train on prompt → trace → answer with the loss on both), process supervision with a step-level reward model (PRM scores each step), and outcome-supervised RL (ORM rewards only the final verified answer) — typically run with GRPO at SLM scale. They get conflated under one label but optimise different things.

What does GRPO replace from PPO, and why does that matter for SLMs?

GRPO drops PPO's value (critic) network. The advantage is computed group-relative — sample several completions for the same prompt, normalise their rewards by group mean and standard deviation. No second network to train, no critic to keep stable; the training is roughly half the memory and computationally simpler. That tractability is what makes RL on SLMs practical.

Track 4 · Advanced · Lesson 15

Reasoning training: CoT SFT, process supervision, GRPO

After this lesson you can distinguish the three things people mean by "reasoning training" (CoT SFT, process supervision via PRM, outcome supervision via ORM-style RL), build a CoT SFT dataset that trains on the trace plus the answer, choose between PRM and ORM for your data budget, recognise why GRPO is the practical RL recipe for SLM-scale reasoning (no value network, group-relative advantage), and avoid the most common honest-metrics failure — calling a model that learned to emit <think> tags a "reasoning" model without verifier-grounded eval.

Level: advanced Read time: ~12 min Prerequisites: Preference tuning: DPO & ORPO

"Reasoning training" arrived as a single banner after o1, DeepSeek-R1, and the wave of distilled reasoning SLMs that followed. The label covers at least three different techniques that get cited interchangeably — chain-of-thought SFT, process-supervised RL with a step-level reward model, and outcome-supervised RL with a verifier — and they don't optimise the same thing, don't need the same data, and don't fail in the same ways. For an SLM team, the difference between picking the right one and picking the wrong one is roughly the difference between a model that does multi-step work and a model that has learned to look like it does.

The three techniques the label covers

Chain-of-thought SFT — supervised fine-tuning where each example is prompt → reasoning trace → final answer, and the loss covers both the trace and the answer. The simplest, the most data-efficient, and the right starting point unless you have very strong reasons to skip it.
Process supervision (PRM) — train a process reward model that scores each step of a reasoning trace, then use it as a reward signal to fine-tune the model with RL. Higher fidelity than outcome reward; far more expensive to collect data for (step-level human labels).
Outcome supervision (ORM) + RL — only the final answer is rewarded (by a verifier: a math executor, a code runner, a unit test, a string-match). The model is trained with RL to produce traces that lead to verifier-passing answers. Cheaper data, harder optimisation, the recipe behind R1-style models.

The three are layered, not exclusive: CoT SFT is usually the warm-start, and one of PRM/ORM follows. Skipping the SFT warm-start with cold RL is possible but quality is typically worse on small models.

Chain-of-thought SFT, in detail

The data shape is the lesson's most concrete artefact:

# one example
{
  "prompt": "Mira has 3 apples. She buys 4 more, eats 1, then gives half to her brother. How many does she have?",
  "trace":  "Start: 3 apples. Buy 4 → 3 + 4 = 7. Eat 1 → 7 - 1 = 6. Give half away → 6 / 2 = 3.",
  "answer": "3"
}

# render with a separator the tokenizer doesn't merge
text = (
    f"<|user|>{prompt}<|end|>\n"
    f"<|assistant|><think>{trace}</think>{answer}<|end|>"
)

# loss mask: train on the trace AND the answer (NOT the prompt — Lesson 1.3)
# Lesson 2.5's collator with a "<|assistant|>" split works here unchanged.

Three things this pattern gets right that hand-rolled versions usually get wrong:

The loss is on both the trace and the answer. Masking only the answer teaches the model to skip the trace; masking only the trace teaches it to hallucinate answers that don't follow.
The trace and the answer are wrapped with the same chat-template tokens (Lesson 1.4) the model already speaks. Inventing new role markers that the tokenizer splits into multiple pieces silently degrades training.
The dataset is verifier-clean: every trace ends in the correct answer. Noisy traces (right answer, wrong arithmetic) teach the model to mimic the noise. Filter aggressively before training.

Process supervision vs outcome supervision

Two ways to score a reasoning trace once the model can produce one:

Outcome reward (ORM) — run a verifier on the final answer. Math problem? Execute the arithmetic. Code problem? Run the unit tests. Multiple-choice? String-match the letter. Reward = 1 if verifier passes, 0 otherwise (or graded). Cheap because the data is just (prompt, gold answer); the trace doesn't need labels.
Process reward (PRM) — train a separate model on step-level human labels: "is this step correct given the steps before it?". Use the PRM to score every step of a candidate trace at training time. Higher fidelity (rewards correctness of reasoning, not just the answer) but the data is expensive — collecting step-level labels at scale was the most labour-intensive part of the early reasoning-model work.

The practical reality at SLM scale: ORM is what most reproductions of the R1 recipe use, because outcome verifiers are cheap to write for math and code (the two domains where almost all public reasoning benchmarks live) and the data scales with no human labellers. PRM gives better local signal but the data cost dominates.

GRPO: why it beats PPO for SLM-scale RL

Proximal Policy Optimisation (PPO) is the canonical RL fine-tuning algorithm — it needs a policy network (the model you're training) and a value network (a critic that estimates expected reward). For SLM-scale reasoning, PPO has two problems: training a competent critic on noisy reasoning rewards is itself hard, and the second network roughly doubles memory.

Group Relative Policy Optimisation (GRPO) — the algorithm behind DeepSeek-R1's public recipe — solves both by dropping the critic entirely. The advantage is computed group-relative:

For each prompt, sample G completions from the current policy (e.g. G = 8).
Score each completion with the verifier or PRM.
Compute the advantage of each sample as (reward - group_mean) / group_std. No value network needed.
Apply a clipped policy-gradient update with a KL penalty to the reference (frozen) model, exactly like PPO from this point.

The wins for an SLM team: no critic to train (half the memory), no critic instability to debug, and the group-relative normalisation handles reward scaling automatically — which matters because verifier rewards are usually 0/1 and would otherwise need careful baseline subtraction. The cost: G× the rollouts per step, which makes inference throughput the new bottleneck. vLLM-style rollouts are basically mandatory.

# sketch of one GRPO step, after CoT SFT warm-start
G = 8                                              # group size
for prompt, gold in batch:
    completions = policy.sample(prompt, n=G, temperature=0.7)
    rewards = [verifier(c, gold) for c in completions]   # 0 or 1
    mean, std = stats(rewards)
    advantages = [(r - mean) / (std + 1e-6) for r in rewards]
    loss = grpo_loss(policy, ref_policy, prompt, completions, advantages, kl_beta=0.04)
    loss.backward(); optimiser.step()

The math-and-code bias of public reasoning benchmarks

Read any reasoning model card and the numbers are on GSM8K, MATH, HumanEval, MBPP, sometimes AIME. There's a reason: those are the domains with cheap automatic verifiers. The verifier writes itself for math (execute the arithmetic) and for code (run the tests); writing a verifier for "is this a good plan?" or "is this a correct legal analysis?" is the actual hard problem nobody has solved at scale. So reasoning models are trained where the verifiers are, and they're reported on benchmarks that ride the same rails.

What this means for an SLM project:

If your task is math or code or anything with a clear binary verifier, reasoning training is a real lever — the R1-style recipe transfers.
If your task is open-ended reasoning (planning, summarisation, multi-step writing, legal analysis), the public benchmarks tell you almost nothing about what to expect on your task, and the recipes don't obviously apply because there's no verifier to plug in.
GSM8K and MATH contamination is now widely-documented (Lesson 2.17); the canonical-vs-paraphrased gap is the contamination tell. Treat the headline numbers with the same caution as any other public benchmark.

Honest beat — small-model "reasoning" is often format-following

The most common way a small-model team declares "reasoning" success: training on R1-distilled traces, generating outputs that emit <think> tags and verbose multi-step prose, observing that the outputs "look like reasoning," and shipping. The model learned the format of a reasoning trace. Whether the trace is correct end-to-end — whether the step-2 arithmetic in step 2 is actually right — is a separate question that only verifier-grounded eval can answer. Reporting trace-emission rate as a reasoning metric is the new vanity number. Always run the math executor / code runner / step-level checker before quoting a reasoning gain.

The minimum honest evaluation

If you train a reasoning SLM, the report has three numbers, in this order:

Verifier-pass rate on a held-out test set the model didn't see during training. Math executor, unit tests, string-match — whatever the task admits. This is the only number that measures the thing "reasoning training" is supposed to improve.
Substrate drift on MMLU / HellaSwag / your task gold set (Lesson 2.17). RL on narrow reasoning data is a textbook recipe for catastrophic forgetting; if the substrate dropped 5 points, your "reasoning" gain is partly other-skill loss.
Trace-format rate, reported separately from verifier-pass rate. The two together let a reader see whether you have format-following only (high trace rate, low verifier pass) or genuine reasoning (both high). One number hides the failure mode.

Key idea

"Reasoning training" is three techniques, not one: CoT SFT (warm-start, train on trace + answer), process or outcome supervision (PRM vs ORM, depending on data budget), and at SLM scale GRPO as the practical RL recipe (no critic, group-relative advantage). Pick by data shape and verifier availability, not by what sounds rigorous. And separate verifier-pass rate from trace-format rate in the report — without that separation, "reasoning" on small models is mostly the model learning to emit a <think> tag.

That closes the v3 expansion of Track 4 — Advanced. From distillation and preference tuning through quantization, multi-task, serving, observability, the feedback loop, tool-use, structured pruning, speculative decoding, and now reasoning training, the track now covers the post-SFT toolkit end-to-end. Claim your certificate next.

Key terms

Reasoning training: An umbrella term covering chain-of-thought SFT, process-supervised RL, and outcome-supervised RL — techniques aimed at producing multi-step traces that a verifier can score, not just final answers.
Chain-of-thought SFT: Supervised fine-tuning where each example is prompt → reasoning trace → final answer, with the loss covering both the trace and the answer. The standard warm-start before any RL-based reasoning step.
Process supervision: Scoring each step of a reasoning trace with a separately-trained reward model (the PRM). Higher fidelity than outcome reward; data is expensive because the labels are step-level human judgments.
Outcome supervision: Scoring only the final answer with a verifier (math executor, code runner, string-match). Cheap data, harder optimisation; the recipe behind R1-style training.
PRM (Process Reward Model): A reward model trained on step-level labels. Scores each step of a candidate trace at RL training time.
ORM (Outcome Reward Model): A reward signal derived from a final-answer verifier. Reward = 1 if the verifier passes, 0 otherwise. Often a deterministic verifier rather than a model.
GRPO (Group Relative Policy Optimisation): RL fine-tuning algorithm used by DeepSeek-R1 and similar recipes. Drops PPO's value network; computes advantage group-relative across G samples per prompt. The practical RL recipe at SLM scale.
Verifier-grounded eval: Evaluation that runs a deterministic checker (math executor, code runner, step checker) on the model's output. The only way to distinguish genuine reasoning from format-following.
Trace-format rate: Fraction of outputs that emit the expected reasoning-trace structure (<think> tags, multi-step prose). Must be reported separately from verifier-pass rate to expose format-following.

Check yourself

Answers are saved to this browser.

Progress is stored locally in your browser.