Reasoning Models: Teaching LLMs to Think Before They Answer

Nish · June 27, 2026

⏱️ 17 min read

Table of Contents

For most of the last few years, the headline story in language models was scale: bigger models, more data, more pre-training compute. Then a different kind of model started topping the hard benchmarks: competition mathematics, contest coding, graduate-level science. These models were not obviously bigger. They were slower. They would sit there and produce thousands of tokens of working before answering a question that a normal model would answer in one line.

These are reasoning models, and they represent a genuine shift in how we get capability out of a fixed set of weights. This post is a practitioner’s tour of what they are, why they work, and the specific reinforcement learning recipe, Group Relative Policy Optimization (GRPO), that made them reproducible outside a handful of frontier labs.

TL;DR

  • A reasoning model is an LLM trained to write an explicit chain of thought before its final answer. It spends more compute at inference to get harder questions right.
  • The core insight is a second scaling axis. Pre-training scales the weights; test-time compute scales the amount of thinking per query. Reasoning models are how you train a model to use that second axis well.
  • The surprising empirical result from DeepSeek-R1-Zero: useful reasoning can emerge from pure reinforcement learning against automatically checkable rewards, with no supervised reasoning examples at all.
  • GRPO is the RL algorithm behind this. It is PPO with the value network deleted. Instead of learning a critic, it samples a group of answers per prompt and grades each one against the group’s own average reward.
  • There are four ways to build these models (inference-time scaling, pure RL, SFT+RL, and distillation), and they trade off cost, control, and ceiling very differently.
  • Reasoning is not free or universal. It burns tokens, adds latency, and does nothing for tasks that were never multi-step in the first place.

Scope and caveats. This is an explainer, not a paper. The plots below are schematic: they illustrate the shape of published results rather than reproduce exact numbers. Where a claim comes from a specific paper I link it, and the Sources section lists the primary material.

What problem are we actually solving?

Standard LLMs are trained to predict the next token. That objective makes them extraordinary at tasks where the answer is essentially a retrieval or interpolation over things they have seen: summarising, rewriting, answering factual questions, drafting code that resembles code they were trained on.

Where they historically fell down is anything that needs a chain of dependent steps, where an early mistake quietly poisons everything after it. Multi-digit arithmetic, competition geometry, a proof, a tricky debugging session. The model has to hold intermediate state, commit to a step, and build on it, and a single-pass forward computation gives it a fixed, shallow budget to do all of that in.

There is a clean intuition here. A transformer doing a single forward pass has a fixed amount of computation per token. Hard problems do not have a fixed difficulty. So the question becomes: how do we let the model spend more computation on harder problems? The answer reasoning models settled on is almost embarrassingly simple: let it write more tokens, and use those tokens as a scratchpad.

Accuracy rising as more samples are drawn per problem, for self-consistency and best-of-N, versus a flat single-sample baseline.
Schematic, not measured data. Three ways to answer the same question with the same frozen model: one greedy sample (the flat baseline), a majority vote over many samples (self-consistency), and picking the best of many with a trained verifier. Spending more compute at inference lifts accuracy even though the weights never change. This is the second scaling axis that reasoning models are built to exploit. Curve shapes follow the test-time-compute literature; see Snell et al. (2024).

This is the idea of test-time compute: quality you buy at inference rather than at training time. A long chain of thought is one way to spend it (think longer on one answer). Sampling many answers and taking a majority vote is another (think wider across many answers). Reasoning models mostly chase the first, because a model that has learned to think longer well is more useful than one you have to sample 64 times.

Chain of thought, and why training beats prompting

The seed of all this is an old prompting observation: if you ask a model to “think step by step”, it gets multi-step problems right more often. Writing out intermediate steps breaks the problem into smaller pieces and gives later tokens relevant context to condition on. That is chain-of-thought prompting.

Prompting gets you part of the way, but it is fragile. The model was never trained to reason; you are coaxing a behaviour out of it that it learned only incidentally. It will happily produce confident, fluent, wrong reasoning. The reasoning-model bet is to stop coaxing and start training the behaviour directly: to make “produce a long, correct chain of thought” the thing the model is explicitly optimised for.

That raises the obvious question: optimised against what? You cannot easily write down a supervised label for “good reasoning”. But for a large class of problems you can cheaply check the final answer. A maths problem has a known solution. Code either passes the unit tests or it does not. That checkability is the hook that lets reinforcement learning in.

The DeepSeek-R1 moment

The model family that made this concrete, and reproducible, was DeepSeek-R1. Its training recipe is worth understanding because it cleanly separates the ideas.

The team first built DeepSeek-R1-Zero: a base model trained with reinforcement learning only, no supervised fine-tuning step, using two simple automatically-checkable rewards.

  • Accuracy reward. Is the final answer correct (matched against a known solution, or via compiler/test execution)?
  • Format reward. Did the model put its reasoning inside <think>...</think> tags and its answer where it belongs?

That is the entire signal. No human preference labels, no reasoning demonstrations. And the striking result is that coherent reasoning emerged anyway. Over training, the model spontaneously started writing longer and longer chains of thought, re-checking its own steps, and backtracking when a line of attack failed: the now-famous “aha moment” where it pauses mid-solution and reconsiders. Nobody designed that behaviour. It was the cheapest way to earn more reward.

Two rising curves over RL training steps: average response length in tokens and accuracy on held-out maths, both increasing together.
Schematic, reproducing the qualitative shape from the DeepSeek-R1 report. As RL proceeds, the model's responses get longer and more accurate together. The extra length is self-discovered reasoning, not an instruction.

R1-Zero had a wart: its raw reasoning was effective but ugly (mixed languages, poor readability) because nothing in the reward cared about presentation. So the full DeepSeek-R1 added structure around the RL core: a small amount of high-quality “cold-start” reasoning data to fine-tune the base model first, then RL, then a round of rejection-sampled supervised fine-tuning on the best outputs, then a final RL pass covering helpfulness and harmlessness as well as reasoning.

flowchart LR A["Base model"] --> B["Cold-start SFT<br/>small CoT set"] B --> C["Reasoning RL<br/>GRPO + verifiable rewards"] C --> D["Rejection-sample<br/>best outputs & SFT"] D --> E["Final RL<br/>reasoning + alignment"] E --> F["DeepSeek-R1"] A -.->|"RL only, no SFT"| Z["DeepSeek-R1-Zero"] class C,E focus; class F terminal; class Z guardrail;
R1-Zero is the pure-RL branch that proved reasoning can emerge from rewards alone. R1 wraps that core in SFT stages for readability and general alignment.

The takeaway most people drew from this is the lower branch, not the upper one. R1-Zero is the existence proof: reasoning is reinforceable. You do not need a giant corpus of worked solutions. You need a way to check answers and an RL algorithm that can climb that signal. Which brings us to GRPO.

GRPO: PPO with the critic deleted

To see what GRPO simplifies, you have to remember how the standard RL-for-LLMs recipe works. The dominant method, inherited from RLHF, is Proximal Policy Optimization (PPO). PPO is an actor-critic method: alongside the policy (the LLM you are training) it trains a second network, the critic or value model, whose job is to estimate how good a given state is so you can compute an advantage, meaning how much better an action was than expected.

That critic is expensive. For an LLM it is typically another model of comparable size to the policy, which you have to hold in memory and train alongside it. On a 70B-parameter policy, the value network roughly doubles your RL memory footprint. (If you want the actor-critic intuition in full, I wrote a separate post on it.)

GRPO’s question is: do we actually need a learned critic? Its answer is no, at least not if you are willing to sample more than one answer per prompt. Instead of asking a network “how good is this state?”, GRPO asks the much simpler empirical question “how good is this answer compared to the other answers the model just gave to the same question?”

The mechanism

For each prompt \(q\), GRPO does the following:

  1. Sample a group of \(G\) completions \(\{o_1, \dots, o_G\}\) from the current policy (typically \(G\) is 8–16).
  2. Score each one with the reward function, giving rewards \(\mathbf{r} = \{r_1, \dots, r_G\}\).
  3. Turn those rewards into advantages by normalising against the group itself:
\[\hat{A}_{i} = \frac{r_i - \text{mean}(\mathbf{r})}{\text{std}(\mathbf{r})}\]

That single line is the whole trick. The group’s mean reward is the baseline. There is no value network; the other samples in the group play the role the critic used to play. An answer that beats its group average gets a positive advantage and is reinforced; one below average gets a negative advantage and is suppressed. Dividing by the standard deviation just scales the update so easy prompts (where everything scores high) and hard prompts (where everything scores low) contribute comparably.

Bar chart of eight sampled completions for one prompt, with green positive-advantage bars for correct answers above the group mean and red negative-advantage bars for incorrect ones below it.
GRPO grades each of the G sampled answers against its own group's mean reward. The group replaces PPO's learned value network as the baseline: correct answers float up, incorrect ones get pushed down.

A worked example. Take the prompt “Calculate \(2 + 2 \times 6\)” and sample a group of \(G = 8\) answers. Suppose 4 of them get the right answer (14) and 4 get it wrong, so the rewards are four 1s and four 0s. Then:

\[\text{mean}(\mathbf{r}) = 0.5, \qquad \text{std}(\mathbf{r}) \approx 0.53\]

and every answer’s advantage falls out immediately:

\[\hat{A}_{\text{correct}} = \frac{1 - 0.5}{0.53} \approx +0.94, \qquad \hat{A}_{\text{wrong}} = \frac{0 - 0.5}{0.53} \approx -0.94\]

That is the entire signal GRPO sends back: push up the four answers that beat the group, push down the four that lagged it, by an equal and opposite amount. Notice what did not happen: nothing estimated the absolute “value” of the prompt, and nothing needed a second network. The four wrong answers became negative examples purely because their peers did better on the same question.

A second piece of intuition matters here: the advantage is computed per answer, but applied per token. Every token in a winning completion is nudged up by the same \(\hat{A}_i\), and every token in a losing one is nudged down. GRPO has no idea which step in a long chain of thought was the clever one; it just makes all of a good answer’s tokens a bit more likely and all of a bad answer’s tokens a bit less likely. Over thousands of groups, the steps that consistently show up in winning answers get reinforced and the dead ends wash out. This is crude credit assignment, and it is part of why GRPO needs many samples and many steps to work, but it is also why it is so simple and robust.

Why normalise by the standard deviation? It turns a raw reward into a z-score within the group, so the update size depends on how exceptional an answer was relative to its peers, not on the absolute reward scale. A side effect is that a prompt where every sample succeeds (or every sample fails) has near-zero spread and contributes almost nothing, which is sensible, because a question the model already always gets right (or always gets wrong) has little to teach it this step. Some recent variants (Dr. GRPO) drop this division to avoid a subtle difficulty bias; see the TRL docs for the details.

The objective

With advantages in hand, GRPO updates the policy with a clipped objective that is otherwise pure PPO. Writing \(\pi_\theta\) for the current policy and \(\pi_{\theta_\text{old}}\) for the policy that generated the samples, the per-token probability ratio is

\[r_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_\text{old}}(o_{i,t} \mid q, o_{i,<t})}\]

and the loss is

\[\mathcal{L}_{\text{GRPO}}(\theta) = - \frac{1}{\sum_{i} |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \left[ \min\left( r_{i,t}\,\hat{A}_{i}, \; \text{clip}(r_{i,t}, 1-\epsilon, 1+\epsilon)\,\hat{A}_{i} \right) - \beta\, \mathbb{D}_{\text{KL}}\!\left[\pi_\theta \,\|\, \pi_{\text{ref}}\right] \right]\]

The ratio \(r_{i,t}\) is just “how much more (or less) likely is this token under the new policy than under the one that generated it?” Multiplying it by the advantage gives the basic policy-gradient signal: if the answer was good (\(\hat{A}_i > 0\)), increase the probability of its tokens; if it was bad, decrease them. Everything else in the objective is there to stop that signal from over-correcting. Two pieces do that work:

  • The clip. The \(\text{clip}(\cdot, 1-\epsilon, 1+\epsilon)\) stops any single update from moving the policy too far in one step. The outer \(\min\) makes it pessimistic: for a good answer it caps how much you are allowed to raise a token’s probability in one go (no leaping to a ratio of 5 because one sample happened to win), and for a bad answer it limits how hard you push it down. Updates that stay inside the trust region \([1-\epsilon, 1+\epsilon]\) pass through untouched; only the runaway ones get clipped. This is the “proximal” in PPO, and it is what keeps training from collapsing.
  • The KL penalty. The \(\beta\, \mathbb{D}_{\text{KL}}[\pi_\theta \,\|\, \pi_{\text{ref}}]\) term tethers the policy to a frozen reference model (usually the starting checkpoint), so it does not drift into degenerate, high-reward gibberish, a real failure mode when the only thing you optimise is a reward. The coefficient \(\beta\) is the dial: higher keeps the model close to its original, coherent self; lower lets it adapt faster but risks reward-hacking. Interestingly, a lot of recent open work sets \(\beta = 0\) and finds GRPO trains fine without it; on verifiable-reward tasks the clip alone is often enough to keep things stable.

Putting the loop together:

Circular diagram of the GRPO training loop: the policy samples a group of G completions, each is scored by the reward function, the rewards are normalised within the group into advantages, and a clipped policy-gradient update with a KL term feeds back into the policy. The centre notes there is no critic; the group is its own baseline.
The GRPO loop, run every training step. The group of samples supplies its own baseline, so there is no separate value network to train.

Why this matters in practice

GRPO is not a better optimiser in some abstract sense; it is a cheaper and simpler one that happens to be a great fit for verifiable-reward reasoning tasks.

  PPO (classic RLHF) GRPO
Baseline for advantage Learned value network Mean reward of the sampled group
Extra model in memory Yes, a full critic No
Samples per prompt One A group of \(G\)
Best fit Dense or learned rewards Cheap verifiable rewards
Main cost Critic training + memory More generation per prompt

You trade memory (no critic) for generation (you sample a whole group every step). On large models that is usually a good trade, and it is a big part of why GRPO put serious reasoning-model training within reach of teams that are not OpenAI. The TRL library implements it in a few dozen lines of configuration:

from trl import GRPOConfig, GRPOTrainer

def reward_correct(completions, answer, **kwargs):
    # 1.0 if the extracted final answer matches, else 0.0
    return [float(extract_answer(c) == a) for c, a in zip(completions, answer)]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-3B",
    reward_funcs=[reward_correct, reward_format],
    args=GRPOConfig(
        num_generations=8,        # G: the group size
        max_completion_length=4096,
        beta=0.0,                 # KL coefficient; 0 disables the reference penalty
    ),
    train_dataset=dataset,
)
trainer.train()

The reward functions are where your domain knowledge lives. For maths you check the final answer; for code you run tests; for format you check the tags. The art is designing rewards that are hard to game: a model under RL pressure is ruthlessly good at finding the cheap way to score, so a sloppy reward gets reward-hacked rather than solved.

Four ways to build a reasoning model

GRPO is the engine behind the most interesting branch, but it is not the only way to get a reasoning model. It is worth seeing the whole menu, because they trade off very differently.

flowchart TB A["Want a reasoning model"] --> B{"Train the weights?"} B -->|No| C["Inference-time scaling<br/>CoT prompts, majority vote"] B -->|Yes| D{"Have a strong<br/>teacher already?"} D -->|Yes| E["Distillation<br/>SFT on teacher traces"] D -->|No| F{"Have reasoning<br/>demonstrations?"} F -->|No| G["Pure RL<br/>R1-Zero style"] F -->|Yes| H["SFT + RL<br/>full R1 recipe"] class C,E,G,H focus; class B,D,F decision;
The four routes to a reasoning model, ordered roughly by how much training they require.
  1. Inference-time scaling. Change nothing about the weights; spend more compute at inference. Chain-of-thought prompting, self-consistency (sample many, majority-vote), best-of-N with a verifier. Cheapest to adopt, but you pay the cost on every query and the ceiling is limited by what the base model can already do.

  2. Pure reinforcement learning. The R1-Zero route. Start from a base model, apply GRPO against verifiable rewards, let reasoning emerge. No reasoning demonstrations needed, just a reward you can compute. Highest “interesting science” payoff and no dependence on an existing reasoning model, but RL is finicky and compute-hungry.

  3. SFT then RL. The full R1 recipe. Warm-start with a little supervised reasoning data for readability, then do the RL. This is the workhorse for production-quality reasoning models: it combines the stability of SFT with the ceiling of RL, and it mirrors the familiar RLHF pipeline.

  4. Distillation. Skip the RL entirely: take a strong reasoning model, generate a big pile of its chain-of-thought traces, and fine-tune a smaller model on them. Astonishingly effective and cheap: distilled small models routinely beat much larger models trained with RL from scratch. The catch is structural: you cannot exceed the teacher, so distillation spreads frontier capability rather than creating it.

A useful rule of thumb. If you have a strong teacher, distillation is almost always the cost-effective move for a smaller model. Pure RL is what you reach for when you are building the teacher, or when no model out there is already good at your task.

When not to reach for a reasoning model

Reasoning models are not a free upgrade, and treating them as a drop-in replacement for every LLM call is a good way to burn money and latency.

  • They are expensive per query. A reasoning model might emit thousands of tokens of hidden thinking before its answer. For a high-volume classification or extraction endpoint, that is a brutal cost multiplier for no benefit.
  • They add latency. “Think for 8,000 tokens” is the opposite of what you want behind an interactive, low-latency feature.
  • They do nothing for non-reasoning tasks. Summarisation, translation, tone rewriting, retrieval-augmented answers: none of these were bottlenecked on multi-step reasoning, so the extra thinking is wasted.
  • The thinking can be wrong with confidence. A long chain of thought reads as authoritative. It is still a sampled artefact and can rationalise its way to a wrong answer. Verifiable domains hide this; open-ended ones do not.

The honest framing is that reasoning is a tool for a class of problems (ones with dependent steps and a checkable notion of correct), not a new default. The most useful production systems route to a reasoning model only when the question warrants it, the same way you would not run a SAT solver to look up a phone number.

Where this leaves us

The conceptual shift worth holding onto is that capability has two scaling axes now, not one. For years the only knob was pre-training: more parameters, more tokens, more compute baked into the weights. Reasoning models added a second knob: teach a model to spend compute at inference, on the specific problem in front of it, and reward it for spending that compute well.

GRPO is the piece that made the second knob practical to turn outside frontier labs. By throwing away the critic and letting a group of samples be its own baseline, it turned “do RL on a giant LLM” from a memory-bound research project into something a small team can run with an open-source library and a verifiable reward. R1-Zero is the result that made it exciting: point that machinery at a checkable task and coherent reasoning falls out, unprompted.

What is still open is how far the checkable-reward trick generalises. Maths and code give you clean rewards almost for free. Most valuable real-world reasoning (strategy, diagnosis, design) does not. The most interesting open question in this space is whether we can manufacture good reward signals for the messy domains, or whether reasoning models stay sharpest exactly where the answer key is easy to write.

Sources and further reading

Citation Information

If you find this content useful & plan on using it, please consider citing it using the following format:

@misc{nish-blog,
  title = {Reasoning Models: Teaching LLMs to Think Before They Answer},
  author = {Nish},
  howpublished = {\url{https://www.nishbhana.com/Reasoning-Models/}},
  note = {[Online; accessed]},
  year = {2026}
}

x.com, Facebook