DeepSeekMath · 2024

Group Relative Policy
Optimization

GRPO is a simpler, more efficient alternative to PPO for fine-tuning language models. Instead of training a separate value network, it estimates baselines from group statistics — responses sampled for the same prompt.

Introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models (2024) and central to training DeepSeek-R1.

1 The Problem with PPO

Recall that PPO training requires an advantage estimate — how much better was this action than what we'd expect on average? The standard approach trains a separate value (critic) network that predicts expected return at each token. This creates two problems:

PPO Architecture
🧠 Policy (actor) LM
+ (trained in parallel)
🧮 Value (critic) LM
⚖️ Reward Model
📈 PPO Gradient Update
Requires 4 models in memory simultaneously: actor, actor reference (SFT), critic, and reward model. Memory-intensive and complex to tune.
GRPO Architecture
🧠 Policy LM
↓ sample G outputs for each prompt
📊 Group Statistics (μ, σ)
+ reward model
⚖️ Reward Model
📈 GRPO Gradient Update
Requires only 3 models: policy, policy reference, reward model. No critic to train. Group baseline is computed analytically.
PPO's value network problem: The critic must predict token-level value across sequences of varying length, which is difficult to train stably. In practice, this adds significant complexity and memory cost without always providing better advantage estimates than a simple baseline.

2 The Core Idea: Group-Relative Advantage

GRPO's key insight: for a given prompt \(x\), sample a group of \(G\) responses \(\{y_1, y_2, \ldots, y_G\}\) from the current policy. Score each with a reward function. Use the group mean and standard deviation as the baseline instead of a learned value function.

\( \hat{A}_i = \dfrac{r_i - \mu_{\text{group}}}{\sigma_{\text{group}}} \quad\text{where}\quad \mu = \frac{1}{G}\sum_{j=1}^G r_j,\quad \sigma = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_j - \mu)^2} \)

This is exactly z-score normalization. It tells us: compared to the other responses the policy generated for this same prompt, was this one better or worse? No critic network needed — just arithmetic on the group's reward scores.

Why Group-Relative Makes Sense

  • Natural baseline: "How does response i compare to what I'd normally produce?" is exactly what we want to measure.
  • Prompt-conditional: The baseline adapts per prompt. An easy prompt has a high group mean; a hard one has a low group mean. The policy gets credit for doing well given the difficulty.
  • Works well with binary rewards: For math problems (correct = 1, wrong = 0), group statistics provide a meaningful signal even without dense reward shaping.
  • No critic training: Eliminates the entire critic optimization loop, halving memory usage and simplifying hyperparameter tuning.

3 Interactive Demo: GRPO Step by Step

Work through one full GRPO training step. Each button below reveals the next stage of the algorithm.

"A train leaves Chicago at 60 mph. Another leaves New York at 80 mph. They start 900 miles apart and travel toward each other. How many hours until they meet?"
Step: waiting for group generation

Group Statistics

Group Size (G)
6
Mean reward (μ)
Std dev (σ)
Max reward

4 The GRPO Objective

For each prompt \(x\), sample \(G\) responses from the old policy \(\pi_{\theta_\text{old}}\). Score each with a reward function \(r(\cdot)\). The GRPO loss is:

\[ \mathcal{L}_\text{GRPO}(\theta) = -\frac{1}{G} \sum_{i=1}^G \min\!\Bigl( \rho_i \,\hat{A}_i,\; \operatorname{clip}(\rho_i, 1{-}\varepsilon, 1{+}\varepsilon)\,\hat{A}_i \Bigr) + \beta\,\mathrm{KL}\!\left[\pi_\theta \;\|\; \pi_\text{ref}\right] \]

Where each term does the following:

\(\rho_i = \pi_\theta(y_i \mid x) / \pi_{\theta_\text{old}}(y_i \mid x)\)

The probability ratio: how much more or less likely is the current policy to produce response \(y_i\) compared to when it was sampled? This is the PPO importance-sampling ratio.

\(\hat{A}_i = (r_i - \mu) / \sigma\)

The group-relative advantage: a z-score comparing response \(i\)'s reward to the group mean. Positive means above average; negative means below average for this prompt.

\(\operatorname{clip}(\rho_i, 1{-}\varepsilon, 1{+}\varepsilon)\)

The PPO clip: prevents the policy from taking steps that are too large. If the ratio strays more than \(\varepsilon \approx 0.2\) from 1, the gradient is clipped. This keeps training stable.

\(\beta\,\mathrm{KL}[\pi_\theta \| \pi_\text{ref}]\)

The KL penalty: prevents the policy from diverging too far from the reference (SFT) model. Without this, the model can reward-hack into degenerate behavior. \(\beta\) controls the strength.

GRPO vs standard REINFORCE: GRPO is essentially REINFORCE with a group-mean baseline and PPO clipping. The clip is key: it makes the optimization more stable than vanilla policy gradient, allowing larger batch sizes and learning rates.

What counts as a reward function in GRPO?

GRPO rewards can be rule-based rather than neural. For mathematical reasoning tasks (like DeepSeek-R1), the reward is simply:

\( r(y, y^*) = \begin{cases} 1.0 & \text{if final answer matches ground truth } y^* \\ 0.0 & \text{otherwise} \end{cases} \)

Additional format rewards (did the model use a chain-of-thought format? did it produce well-structured reasoning?) can be added. The binary nature of these rewards is exactly why group statistics work well — they create a meaningful signal even without a neural reward model.

GRPO vs PPO — Side by Side

Property PPO GRPO
Advantage estimation Learned value (critic) network Group-relative z-score — no training
Models in memory 4 (actor, reference, critic, RM) 3 (policy, reference, RM)
Reward type Dense or sparse, neural RM typical Works with sparse binary rewards
Stability Can be unstable; critic must be trained carefully More stable; no critic instability
Samples per prompt 1–2 (typical) G = 4–16 (higher memory per batch)
Best for General RLHF with dense rewards Math, code, reasoning with verifiable rewards
Used in InstructGPT, Claude, most RLHF pipelines DeepSeekMath, DeepSeek-R1, QwQ
When to choose GRPO: If your reward signal is verifiable and rule-based (math answers, code tests, structured output), GRPO eliminates the need to train and maintain a neural reward model for advantage estimation. It's simpler to implement, uses less memory, and performs at least as well as PPO on reasoning benchmarks.
New to RLHF? Start with the ← RLHF demo to understand the full three-phase pipeline that GRPO is an improvement on.