Glossary Entry

Group Relative Policy Optimization

A reinforcement learning algorithm that drops PPO's value network and instead estimates advantages by comparing a group of sampled answers to the same prompt against their own average reward.

RL Training LLMs

Also called: GRPO

Seed source: DeepSeekMath

Group Relative Policy Optimization (GRPO) is the reinforcement learning method introduced in DeepSeekMath and used to train the DeepSeek-R1 reasoning models. It is a variant of Proximal Policy Optimization that removes the separately trained critic (value) network, which roughly halves the memory needed for RL on large models.

In place of a learned baseline, GRPO samples a group of completions for each prompt, scores them with a reward function, and normalises each reward against the group’s mean and standard deviation. Answers that beat their group average are reinforced and those below it are suppressed, with a clipped objective and optional KL penalty keeping updates close to the reference policy.