Table of Contents
- TL;DR
- What problem are we actually solving?
- Chain of thought, and why training beats prompting
- The DeepSeek-R1 moment
- GRPO: PPO with the critic deleted
- Four ways to build a reasoning model
- When not to reach for a reasoning model
- Where this leaves us
- Sources and further reading
For most of the last few years, the headline story in language models was scale: bigger models, more data, more pre-training compute. Then a different kind of model started topping the hard benchmarks: competition mathematics, contest coding, graduate-level science. These models were not obviously bigger. They were slower. They would sit there and produce thousands of tokens of working before answering a question that a normal model would answer in one line.
These are reasoning models, and they represent a genuine shift in how we get capability out of a fixed set of weights. This post is a practitioner’s tour of what they are, why they work, and the specific reinforcement learning recipe, Group Relative Policy Optimization (GRPO), that made them reproducible outside a handful of frontier labs.
TL;DR
- A reasoning model is an LLM trained to write an explicit chain of thought before its final answer. It spends more compute at inference to get harder questions right.
- The core insight is a second scaling axis. Pre-training scales the weights; test-time compute scales the amount of thinking per query. Reasoning models are how you train a model to use that second axis well.
- The surprising empirical result from DeepSeek-R1-Zero: useful reasoning can emerge from pure reinforcement learning against automatically checkable rewards, with no supervised reasoning examples at all.
- GRPO is the RL algorithm behind this. It is PPO with the value network deleted. Instead of learning a critic, it samples a group of answers per prompt and grades each one against the group’s own average reward.
- There are four ways to build these models (inference-time scaling, pure RL, SFT+RL, and distillation), and they trade off cost, control, and ceiling very differently.
- Reasoning is not free or universal. It burns tokens, adds latency, and does nothing for tasks that were never multi-step in the first place.
Scope and caveats. This is an explainer, not a paper. The plots below are schematic: they illustrate the shape of published results rather than reproduce exact numbers. Where a claim comes from a specific paper I link it, and the Sources section lists the primary material.
What problem are we actually solving?
Standard LLMs are trained to predict the next token. That objective makes them extraordinary at tasks where the answer is essentially a retrieval or interpolation over things they have seen: summarising, rewriting, answering factual questions, drafting code that resembles code they were trained on.
Where they historically fell down is anything that needs a chain of dependent steps, where an early mistake quietly poisons everything after it. Multi-digit arithmetic, competition geometry, a proof, a tricky debugging session. The model has to hold intermediate state, commit to a step, and build on it, and a single-pass forward computation gives it a fixed, shallow budget to do all of that in.
There is a clean intuition here. A transformer doing a single forward pass has a fixed amount of computation per token. Hard problems do not have a fixed difficulty. So the question becomes: how do we let the model spend more computation on harder problems? The answer reasoning models settled on is almost embarrassingly simple: let it write more tokens, and use those tokens as a scratchpad.
This is the idea of test-time compute: quality you buy at inference rather than at training time. A long chain of thought is one way to spend it (think longer on one answer). Sampling many answers and taking a majority vote is another (think wider across many answers). Reasoning models mostly chase the first, because a model that has learned to think longer well is more useful than one you have to sample 64 times.
Chain of thought, and why training beats prompting
The seed of all this is an old prompting observation: if you ask a model to “think step by step”, it gets multi-step problems right more often. Writing out intermediate steps breaks the problem into smaller pieces and gives later tokens relevant context to condition on. That is chain-of-thought prompting.
Prompting gets you part of the way, but it is fragile. The model was never trained to reason; you are coaxing a behaviour out of it that it learned only incidentally. It will happily produce confident, fluent, wrong reasoning. The reasoning-model bet is to stop coaxing and start training the behaviour directly: to make “produce a long, correct chain of thought” the thing the model is explicitly optimised for.
That raises the obvious question: optimised against what? You cannot easily write down a supervised label for “good reasoning”. But for a large class of problems you can cheaply check the final answer. A maths problem has a known solution. Code either passes the unit tests or it does not. That checkability is the hook that lets reinforcement learning in.
The DeepSeek-R1 moment
The model family that made this concrete, and reproducible, was DeepSeek-R1. Its training recipe is worth understanding because it cleanly separates the ideas.
The team first built DeepSeek-R1-Zero: a base model trained with reinforcement learning only, no supervised fine-tuning step, using two simple automatically-checkable rewards.
- Accuracy reward. Is the final answer correct (matched against a known solution, or via compiler/test execution)?
- Format reward. Did the model put its reasoning inside
<think>...</think>tags and its answer where it belongs?
That is the entire signal. No human preference labels, no reasoning demonstrations. And the striking result is that coherent reasoning emerged anyway. Over training, the model spontaneously started writing longer and longer chains of thought, re-checking its own steps, and backtracking when a line of attack failed: the now-famous “aha moment” where it pauses mid-solution and reconsiders. Nobody designed that behaviour. It was the cheapest way to earn more reward.
R1-Zero had a wart: its raw reasoning was effective but ugly (mixed languages, poor readability) because nothing in the reward cared about presentation. So the full DeepSeek-R1 added structure around the RL core: a small amount of high-quality “cold-start” reasoning data to fine-tune the base model first, then RL, then a round of rejection-sampled supervised fine-tuning on the best outputs, then a final RL pass covering helpfulness and harmlessness as well as reasoning.
The takeaway most people drew from this is the lower branch, not the upper one. R1-Zero is the existence proof: reasoning is reinforceable. You do not need a giant corpus of worked solutions. You need a way to check answers and an RL algorithm that can climb that signal. Which brings us to GRPO.
GRPO: PPO with the critic deleted
To see what GRPO simplifies, you have to remember how the standard RL-for-LLMs recipe works. The dominant method, inherited from RLHF, is Proximal Policy Optimization (PPO). PPO is an actor-critic method: alongside the policy (the LLM you are training) it trains a second network, the critic or value model, whose job is to estimate how good a given state is so you can compute an advantage, meaning how much better an action was than expected.
That critic is expensive. For an LLM it is typically another model of comparable size to the policy, which you have to hold in memory and train alongside it. On a 70B-parameter policy, the value network roughly doubles your RL memory footprint. (If you want the actor-critic intuition in full, I wrote a separate post on it.)
GRPO’s question is: do we actually need a learned critic? Its answer is no, at least not if you are willing to sample more than one answer per prompt. Instead of asking a network “how good is this state?”, GRPO asks the much simpler empirical question “how good is this answer compared to the other answers the model just gave to the same question?”
The mechanism
For each prompt \(q\), GRPO does the following:
- Sample a group of \(G\) completions \(\{o_1, \dots, o_G\}\) from the current policy (typically \(G\) is 8–16).
- Score each one with the reward function, giving rewards \(\mathbf{r} = \{r_1, \dots, r_G\}\).
- Turn those rewards into advantages by normalising against the group itself:
That single line is the whole trick. The group’s mean reward is the baseline. There is no value network; the other samples in the group play the role the critic used to play. An answer that beats its group average gets a positive advantage and is reinforced; one below average gets a negative advantage and is suppressed. Dividing by the standard deviation just scales the update so easy prompts (where everything scores high) and hard prompts (where everything scores low) contribute comparably.
A worked example. Take the prompt “Calculate \(2 + 2 \times 6\)” and sample a group of \(G = 8\) answers. Suppose 4 of them get the right answer (14) and 4 get it wrong, so the rewards are four 1s and four 0s. Then:
\[\text{mean}(\mathbf{r}) = 0.5, \qquad \text{std}(\mathbf{r}) \approx 0.53\]and every answer’s advantage falls out immediately:
\[\hat{A}_{\text{correct}} = \frac{1 - 0.5}{0.53} \approx +0.94, \qquad \hat{A}_{\text{wrong}} = \frac{0 - 0.5}{0.53} \approx -0.94\]That is the entire signal GRPO sends back: push up the four answers that beat the group, push down the four that lagged it, by an equal and opposite amount. Notice what did not happen: nothing estimated the absolute “value” of the prompt, and nothing needed a second network. The four wrong answers became negative examples purely because their peers did better on the same question.
A second piece of intuition matters here: the advantage is computed per answer, but applied per token. Every token in a winning completion is nudged up by the same \(\hat{A}_i\), and every token in a losing one is nudged down. GRPO has no idea which step in a long chain of thought was the clever one; it just makes all of a good answer’s tokens a bit more likely and all of a bad answer’s tokens a bit less likely. Over thousands of groups, the steps that consistently show up in winning answers get reinforced and the dead ends wash out. This is crude credit assignment, and it is part of why GRPO needs many samples and many steps to work, but it is also why it is so simple and robust.
Why normalise by the standard deviation? It turns a raw reward into a z-score within the group, so the update size depends on how exceptional an answer was relative to its peers, not on the absolute reward scale. A side effect is that a prompt where every sample succeeds (or every sample fails) has near-zero spread and contributes almost nothing, which is sensible, because a question the model already always gets right (or always gets wrong) has little to teach it this step. Some recent variants (Dr. GRPO) drop this division to avoid a subtle difficulty bias; see the TRL docs for the details.
The objective
With advantages in hand, GRPO updates the policy with a clipped objective that is otherwise pure PPO. Writing \(\pi_\theta\) for the current policy and \(\pi_{\theta_\text{old}}\) for the policy that generated the samples, the per-token probability ratio is
\[r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t} \mid q, o_{i,<t})}\]and the loss is
\[\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{\sum_{i} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \left[ \min\left( r_{i,t}\,\hat{A}_{i}, \; \text{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)\,\hat{A}_{i} \right) - \beta\, \mathbb{D}_{\text{KL}}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] \right]\]The ratio \(r_{i,t}\) is just “how much more (or less) likely is this token under the new policy than under the one that generated it?” Multiplying it by the advantage gives the basic policy-gradient signal: if the answer was good (\(\hat{A}_i > 0\)), increase the probability of its tokens; if it was bad, decrease them. Everything else in the objective is there to stop that signal from over-correcting. Two pieces do that work:
- The clip. The \(\text{clip}(\cdot, 1-\epsilon, 1+\epsilon)\) stops any single update from moving the policy too far in one step. The outer \(\min\) makes it pessimistic: for a good answer it caps how much you are allowed to raise a token’s probability in one go (no leaping to a ratio of 5 because one sample happened to win), and for a bad answer it limits how hard you push it down. Updates that stay inside the trust region \([1-\epsilon, 1+\epsilon]\) pass through untouched; only the runaway ones get clipped. This is the “proximal” in PPO, and it is what keeps training from collapsing.
- The KL penalty. The \(\beta\, \mathbb{D}_{\text{KL}}[\pi_\theta \,\|\, \pi_{\text{ref}}]\) term tethers the policy to a frozen reference model (usually the starting checkpoint), so it does not drift into degenerate, high-reward gibberish, a real failure mode when the only thing you optimise is a reward. The coefficient \(\beta\) is the dial: higher keeps the model close to its original, coherent self; lower lets it adapt faster but risks reward-hacking. Interestingly, a lot of recent open work sets \(\beta = 0\) and finds GRPO trains fine without it; on verifiable-reward tasks the clip alone is often enough to keep things stable.
Putting the loop together:
Why this matters in practice
GRPO is not a better optimiser in some abstract sense; it is a cheaper and simpler one that happens to be a great fit for verifiable-reward reasoning tasks.
| PPO (classic RLHF) | GRPO | |
|---|---|---|
| Baseline for advantage | Learned value network | Mean reward of the sampled group |
| Extra model in memory | Yes, a full critic | No |
| Samples per prompt | One | A group of \(G\) |
| Best fit | Dense or learned rewards | Cheap verifiable rewards |
| Main cost | Critic training + memory | More generation per prompt |
You trade memory (no critic) for generation (you sample a whole group every step). On large models that is usually a good trade, and it is a big part of why GRPO put serious reasoning-model training within reach of teams that are not OpenAI. The TRL library implements it in a few dozen lines of configuration:
from trl import GRPOConfig, GRPOTrainer
def reward_correct(completions, answer, **kwargs):
# 1.0 if the extracted final answer matches, else 0.0
return [float(extract_answer(c) == a) for c, a in zip(completions, answer)]
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-3B",
reward_funcs=[reward_correct, reward_format],
args=GRPOConfig(
num_generations=8, # G: the group size
max_completion_length=4096,
beta=0.0, # KL coefficient; 0 disables the reference penalty
),
train_dataset=dataset,
)
trainer.train()
The reward functions are where your domain knowledge lives. For maths you check the final answer; for code you run tests; for format you check the tags. The art is designing rewards that are hard to game: a model under RL pressure is ruthlessly good at finding the cheap way to score, so a sloppy reward gets reward-hacked rather than solved.
Four ways to build a reasoning model
GRPO is the engine behind the most interesting branch, but it is not the only way to get a reasoning model. It is worth seeing the whole menu, because they trade off very differently.
-
Inference-time scaling. Change nothing about the weights; spend more compute at inference. Chain-of-thought prompting, self-consistency (sample many, majority-vote), best-of-N with a verifier. Cheapest to adopt, but you pay the cost on every query and the ceiling is limited by what the base model can already do.
-
Pure reinforcement learning. The R1-Zero route. Start from a base model, apply GRPO against verifiable rewards, let reasoning emerge. No reasoning demonstrations needed, just a reward you can compute. Highest “interesting science” payoff and no dependence on an existing reasoning model, but RL is finicky and compute-hungry.
-
SFT then RL. The full R1 recipe. Warm-start with a little supervised reasoning data for readability, then do the RL. This is the workhorse for production-quality reasoning models: it combines the stability of SFT with the ceiling of RL, and it mirrors the familiar RLHF pipeline.
-
Distillation. Skip the RL entirely: take a strong reasoning model, generate a big pile of its chain-of-thought traces, and fine-tune a smaller model on them. Astonishingly effective and cheap: distilled small models routinely beat much larger models trained with RL from scratch. The catch is structural: you cannot exceed the teacher, so distillation spreads frontier capability rather than creating it.
A useful rule of thumb. If you have a strong teacher, distillation is almost always the cost-effective move for a smaller model. Pure RL is what you reach for when you are building the teacher, or when no model out there is already good at your task.
When not to reach for a reasoning model
Reasoning models are not a free upgrade, and treating them as a drop-in replacement for every LLM call is a good way to burn money and latency.
- They are expensive per query. A reasoning model might emit thousands of tokens of hidden thinking before its answer. For a high-volume classification or extraction endpoint, that is a brutal cost multiplier for no benefit.
- They add latency. “Think for 8,000 tokens” is the opposite of what you want behind an interactive, low-latency feature.
- They do nothing for non-reasoning tasks. Summarisation, translation, tone rewriting, retrieval-augmented answers: none of these were bottlenecked on multi-step reasoning, so the extra thinking is wasted.
- The thinking can be wrong with confidence. A long chain of thought reads as authoritative. It is still a sampled artefact and can rationalise its way to a wrong answer. Verifiable domains hide this; open-ended ones do not.
The honest framing is that reasoning is a tool for a class of problems (ones with dependent steps and a checkable notion of correct), not a new default. The most useful production systems route to a reasoning model only when the question warrants it, the same way you would not run a SAT solver to look up a phone number.
Where this leaves us
The conceptual shift worth holding onto is that capability has two scaling axes now, not one. For years the only knob was pre-training: more parameters, more tokens, more compute baked into the weights. Reasoning models added a second knob: teach a model to spend compute at inference, on the specific problem in front of it, and reward it for spending that compute well.
GRPO is the piece that made the second knob practical to turn outside frontier labs. By throwing away the critic and letting a group of samples be its own baseline, it turned “do RL on a giant LLM” from a memory-bound research project into something a small team can run with an open-source library and a verifiable reward. R1-Zero is the result that made it exciting: point that machinery at a checkable task and coherent reasoning falls out, unprompted.
What is still open is how far the checkable-reward trick generalises. Maths and code give you clean rewards almost for free. Most valuable real-world reasoning (strategy, diagnosis, design) does not. The most interesting open question in this space is whether we can manufacture good reward signals for the messy domains, or whether reasoning models stay sharpest exactly where the answer key is easy to write.
Sources and further reading
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning — the primary paper for R1 and R1-Zero, the emergent-reasoning result, and the multi-stage pipeline.
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning — where GRPO is introduced, with the full derivation against PPO.
- Understanding Reasoning LLMs — Sebastian Raschka’s excellent breakdown of the four approaches and the budget-conscious variants.
- Hugging Face LLM Course, Chapter 12 — a hands-on walkthrough of reasoning models and training GRPO with TRL, including the exact objective and advantage formulas.
- TRL
GRPOTrainerdocumentation — the reference implementation, plus later refinements (Dr. GRPO, DAPO) that fix length and difficulty biases in the original loss. - Chain-of-Thought Prompting Elicits Reasoning in LLMs — the original chain-of-thought prompting paper, where the story starts.
- Self-Consistency Improves Chain of Thought Reasoning (Wang et al.) and Training Verifiers to Solve Math Word Problems (Cobbe et al.) — the majority-vote and best-of-N-with-verifier results behind the first figure.
- Scaling LLM Test-Time Compute Optimally (Snell et al.) — a careful study of when extra inference compute beats a bigger model.
