Group Relative Policy
Optimization
GRPO is a simpler, more efficient alternative to PPO for fine-tuning language models. Instead of training a separate value network, it estimates baselines from group statistics — responses sampled for the same prompt.
Introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) and central to training DeepSeek-R1.
1 The Problem with PPO
Recall that PPO training requires an advantage estimate — how much better was this action than what we'd expect on average? The standard approach trains a separate value (critic) network that predicts expected return at each token. This creates two problems:
2 The Core Idea: Group-Relative Advantage
GRPO's key insight: for a given prompt \(x\), sample a group of \(G\) responses \(\{y_1, y_2, \ldots, y_G\}\) from the current policy. Score each with a reward function. Use the group mean and standard deviation as the baseline instead of a learned value function.
This is exactly z-score normalization. It tells us: compared to the other responses the policy generated for this same prompt, was this one better or worse? No critic network needed — just arithmetic on the group's reward scores.
Why Group-Relative Makes Sense
- Natural baseline: "How does response i compare to what I'd normally produce?" is exactly what we want to measure.
- Prompt-conditional: The baseline adapts per prompt. An easy prompt has a high group mean; a hard one has a low group mean. The policy gets credit for doing well given the difficulty.
- Works well with binary rewards: For math problems (correct = 1, wrong = 0), group statistics provide a meaningful signal even without dense reward shaping.
- No critic training: Eliminates the entire critic optimization loop, halving memory usage and simplifying hyperparameter tuning.
3 Interactive Demo: GRPO Step by Step
Work through one full GRPO training step. Each button below reveals the next stage of the algorithm.
Group Statistics
4 The GRPO Objective
For each prompt \(x\), sample \(G\) responses from the old policy \(\pi_{\theta_\text{old}}\). Score each with a reward function \(r(\cdot)\). The GRPO loss is:
Where each term does the following:
\(\rho_i = \pi_\theta(y_i \mid x) / \pi_{\theta_\text{old}}(y_i \mid x)\)
The probability ratio: how much more or less likely is the current policy to produce response \(y_i\) compared to when it was sampled? This is the PPO importance-sampling ratio.
\(\hat{A}_i = (r_i - \mu) / \sigma\)
The group-relative advantage: a z-score comparing response \(i\)'s reward to the group mean. Positive means above average; negative means below average for this prompt.
\(\operatorname{clip}(\rho_i, 1{-}\varepsilon, 1{+}\varepsilon)\)
The PPO clip: prevents the policy from taking steps that are too large. If the ratio strays more than \(\varepsilon \approx 0.2\) from 1, the gradient is clipped. This keeps training stable.
\(\beta\,\mathrm{KL}[\pi_\theta \| \pi_\text{ref}]\)
The KL penalty: prevents the policy from diverging too far from the reference (SFT) model. Without this, the model can reward-hack into degenerate behavior. \(\beta\) controls the strength.
What counts as a reward function in GRPO?
GRPO rewards can be rule-based rather than neural. For mathematical reasoning tasks (like DeepSeek-R1), the reward is simply:
Additional format rewards (did the model use a chain-of-thought format? did it produce well-structured reasoning?) can be added. The binary nature of these rewards is exactly why group statistics work well — they create a meaningful signal even without a neural reward model.
GRPO vs PPO — Side by Side
| Property | PPO | GRPO |
|---|---|---|
| Advantage estimation | Learned value (critic) network | Group-relative z-score — no training |
| Models in memory | 4 (actor, reference, critic, RM) | 3 (policy, reference, RM) |
| Reward type | Dense or sparse, neural RM typical | Works with sparse binary rewards |
| Stability | Can be unstable; critic must be trained carefully | More stable; no critic instability |
| Samples per prompt | 1–2 (typical) | G = 4–16 (higher memory per batch) |
| Best for | General RLHF with dense rewards | Math, code, reasoning with verifiable rewards |
| Used in | InstructGPT, Claude, most RLHF pipelines | DeepSeekMath, DeepSeek-R1, QwQ |