Reinforcement Learning
from Human Feedback
How do we teach a language model to be helpful, harmless, and honest? RLHF is the three-phase pipeline that turned raw language models into assistants like ChatGPT.
The Problem: Raw LMs Are Not Assistants
A pretrained language model is trained to predict the next token given everything before it. It's excellent at completing text — but it doesn't know that you want a helpful answer to your question. It might complete your question with another question, or generate plausible-sounding nonsense.
Base Model (unhelpful)
RLHF Model (helpful)
sorted() function or the .sort() method:
nums = [3, 1, 4, 1, 5]
sorted_nums = sorted(nums) # new list
nums.sort() # in-place
For reverse order, add reverse=True.1 Supervised Fine-Tuning (SFT)
The first step is to teach the model the format of a helpful conversation. Human contractors write ideal responses to a diverse set of prompts. The model is fine-tuned on these (prompt, ideal response) pairs using standard cross-entropy loss.
Sample SFT Training Data
Each example pairs a user prompt with a human-written demonstration of the desired behavior.
| Prompt | Human-Written Ideal Response |
|---|---|
| Summarize the French Revolution. | The French Revolution (1789–1799) was a period of radical political change that overthrew the monarchy, established a republic, and ended with Napoleon's rise. Key causes included financial crisis, inequality, and Enlightenment ideas. |
| What's the difference between RAM and ROM? | RAM (Random Access Memory) is temporary, fast storage that loses data when power is off. ROM (Read-Only Memory) is permanent storage that retains data without power, used for firmware. Your running programs live in RAM; your computer's boot instructions live in ROM. |
| Write a haiku about debugging. | Semicolon missed— the program crashes at dawn. Coffee grows cold. Wait. |
| How do vaccines work? | Vaccines introduce a harmless piece of a pathogen (a protein, dead virus, or mRNA instructions) to train your immune system. Your body builds antibodies, so if you encounter the real pathogen later, your immune system can respond faster and prevent serious illness. |
After SFT, the model is called the SFT model or reference policy — it will also serve as a regularizer during the final RL phase (to keep the model from going off the rails).
2 Collect Human Preferences
Rather than writing ideal responses, it's easier and faster for humans to compare two responses and pick the better one. Even if you can't define "good", you can usually tell which of two answers is better. This pairwise comparison data trains the reward model.
2 Train the Reward Model
The reward model (RM) is a neural network that takes a (prompt, response) pair and outputs a scalar score — how good is this response? It's trained on the pairwise preference data using the Bradley-Terry model:
Where \(r_\theta(x, y)\) is the learned reward, \(\sigma\) is the sigmoid function, and \(y_A \succ y_B\) means "A is preferred over B". The loss is simply the binary cross-entropy between model predictions and human labels.
Reward Model Scoring
Below are 5 responses to the prompt "How do I become a better programmer?" The RM assigns each a score. Label all 3 preference pairs above first to activate scoring.
3 RL Optimization with PPO
Now we use the reward model as a signal to improve the language model policy using Proximal Policy Optimization (PPO). The SFT model is the starting point; PPO nudges it toward responses the RM scores highly, while a KL penalty keeps it from straying too far.
generates response
scores response
update policy weights
The PPO objective clips the policy update to prevent large steps that could destabilize training. A KL divergence penalty from the SFT reference policy is added to prevent reward hacking:
Where \(\rho_t = \pi_\theta(y_t \mid x) / \pi_\text{old}(y_t \mid x)\) is the probability ratio, \(\hat{A}_t\) is the advantage estimate (from a learned value function), \(\varepsilon \approx 0.2\) clips the ratio, and \(\beta\) controls the KL penalty strength.
Interactive PPO Training Loop
Click "Run Step" to simulate one PPO iteration. Watch the reward trend as the policy improves.
Before vs. After RLHF
The same prompt sent to a base model, an SFT model, and a fully RLHF-trained model produces very different results.
🤖 Base LM
📚 SFT Model
✨ RLHF Model
RLHF Limitations to Know
- Reward hacking: The model learns to game the RM — producing responses that score well but aren't truly helpful.
- RM imperfection: The RM is a proxy; it will disagree with humans in edge cases.
- Value network complexity: PPO requires training a separate value (critic) network alongside the policy, which doubles complexity. → GRPO solves this
- Human labeler bias: Preferences reflect whoever wrote them — not a universal standard of quality.