Training AI Systems

Reinforcement Learning
from Human Feedback

How do we teach a language model to be helpful, harmless, and honest? RLHF is the three-phase pipeline that turned raw language models into assistants like ChatGPT.

Base
Pretrained LM
Predicts next tokens
Phase 1
SFT
Learn from demos
Phase 2
Reward Model
Learn from prefs
Phase 3
PPO
RL optimization

The Problem: Raw LMs Are Not Assistants

A pretrained language model is trained to predict the next token given everything before it. It's excellent at completing text — but it doesn't know that you want a helpful answer to your question. It might complete your question with another question, or generate plausible-sounding nonsense.

Base Model (unhelpful)

User prompt
How do I sort a list in Python?
Completion
How do I sort a list in JavaScript? How do I sort a list in Java? Here is a comparison of sorting algorithms across languages...

RLHF Model (helpful)

User prompt
How do I sort a list in Python?
Response
Use the built-in sorted() function or the .sort() method: nums = [3, 1, 4, 1, 5] sorted_nums = sorted(nums) # new list nums.sort() # in-place For reverse order, add reverse=True.
Why is specifying "good" so hard? We can't easily write a mathematical formula that captures what makes a response helpful, harmless, and honest across all possible inputs. Human judgment is much richer than any hand-crafted reward function.

1 Supervised Fine-Tuning (SFT)

The first step is to teach the model the format of a helpful conversation. Human contractors write ideal responses to a diverse set of prompts. The model is fine-tuned on these (prompt, ideal response) pairs using standard cross-entropy loss.

Sample SFT Training Data

Each example pairs a user prompt with a human-written demonstration of the desired behavior.

PromptHuman-Written Ideal Response
Summarize the French Revolution. The French Revolution (1789–1799) was a period of radical political change that overthrew the monarchy, established a republic, and ended with Napoleon's rise. Key causes included financial crisis, inequality, and Enlightenment ideas.
What's the difference between RAM and ROM? RAM (Random Access Memory) is temporary, fast storage that loses data when power is off. ROM (Read-Only Memory) is permanent storage that retains data without power, used for firmware. Your running programs live in RAM; your computer's boot instructions live in ROM.
Write a haiku about debugging. Semicolon missed—
the program crashes at dawn.
Coffee grows cold. Wait.
How do vaccines work? Vaccines introduce a harmless piece of a pathogen (a protein, dead virus, or mRNA instructions) to train your immune system. Your body builds antibodies, so if you encounter the real pathogen later, your immune system can respond faster and prevent serious illness.
SFT limitation: The model now knows the format of a good response, but it's still hard to collect enough demonstrations to cover all possible inputs and preferences. This is where human preferences become essential.

After SFT, the model is called the SFT model or reference policy — it will also serve as a regularizer during the final RL phase (to keep the model from going off the rails).

2 Collect Human Preferences

Rather than writing ideal responses, it's easier and faster for humans to compare two responses and pick the better one. Even if you can't define "good", you can usually tell which of two answers is better. This pairwise comparison data trains the reward model.

Your turn: Act as a human labeler. For each prompt below, choose which response you prefer. Your choices will be used to "train" the reward model in the next section.
Labeled 0 of 3 pairs
Pair 1 of 3
Response A
Response B

2 Train the Reward Model

The reward model (RM) is a neural network that takes a (prompt, response) pair and outputs a scalar score — how good is this response? It's trained on the pairwise preference data using the Bradley-Terry model:

\( P(y_A \succ y_B \mid x) \;=\; \sigma\!\bigl(r_\theta(x, y_A) - r_\theta(x, y_B)\bigr) \)

Where \(r_\theta(x, y)\) is the learned reward, \(\sigma\) is the sigmoid function, and \(y_A \succ y_B\) means "A is preferred over B". The loss is simply the binary cross-entropy between model predictions and human labels.

Reward Model Scoring

Below are 5 responses to the prompt "How do I become a better programmer?" The RM assigns each a score. Label all 3 preference pairs above first to activate scoring.

Practice consistently by building real projects. Read production codebases. Learn to debug methodically.
Score
Write a lot of code. Try different projects. Watch YouTube tutorials when stuck.
Score
Programming is hard. It takes years. You should start with Python. Or maybe JavaScript. Both are good choices for beginners who want to learn web development or data science.
Score
just google everything lol. copy paste from stackoverflow. you don't need to understand everything
Score
I recommend buying courses from Udemy and getting certifications. The job market values credentials highly so invest in expensive bootcamps first.
Score
Key insight: The reward model is now a proxy for human preference. We can call it millions of times during RL training — something we can't do with actual humans. But it's imperfect: models learn to "game" the RM (reward hacking), which is why the KL penalty in PPO is important.

3 RL Optimization with PPO

Now we use the reward model as a signal to improve the language model policy using Proximal Policy Optimization (PPO). The SFT model is the starting point; PPO nudges it toward responses the RM scores highly, while a KL penalty keeps it from straying too far.

🧠 LM Policy
generates response
⚖️ Reward Model
scores response
📈 PPO Update
update policy weights
repeat

The PPO objective clips the policy update to prevent large steps that could destabilize training. A KL divergence penalty from the SFT reference policy is added to prevent reward hacking:

\( \mathcal{L}_\text{PPO} = \mathbb{E}\!\left[\min\!\left(\rho_t \hat{A}_t,\;\operatorname{clip}(\rho_t, 1{-}\varepsilon, 1{+}\varepsilon)\,\hat{A}_t\right)\right] - \beta\,\mathrm{KL}\!\left[\pi_\theta \;\|\; \pi_\text{SFT}\right] \)

Where \(\rho_t = \pi_\theta(y_t \mid x) / \pi_\text{old}(y_t \mid x)\) is the probability ratio, \(\hat{A}_t\) is the advantage estimate (from a learned value function), \(\varepsilon \approx 0.2\) clips the ratio, and \(\beta\) controls the KL penalty strength.

Interactive PPO Training Loop

Click "Run Step" to simulate one PPO iteration. Watch the reward trend as the policy improves.

Step0 / 12
Sampled response
Reward Model score
KL from SFT
Policy update

Before vs. After RLHF

The same prompt sent to a base model, an SFT model, and a fully RLHF-trained model produces very different results.

Prompt: "My code isn't working. How do I fix it?"

🤖 Base LM

My code isn't working either. What language is your code? Here are some common languages: Python, JavaScript, Java, C++, Ruby, Go, Rust, Swift, Kotlin, TypeScript...
Completes the text pattern — doesn't help.

📚 SFT Model

I'd be happy to help! To debug your code: check error messages carefully, add print statements to trace execution, search the error online, and isolate the problem by testing smaller pieces.
Follows the format well, but generic advice.

✨ RLHF Model

Happy to help debug! Could you share: 1. The code snippet 2. The error message (if any) 3. What you expected vs. what happened That'll let me give you specific, actionable advice rather than generic tips.
Asks for details — genuinely useful interaction.
Key takeaway: RLHF doesn't just make the model "sound" better — it changes how the model reasons about what a helpful response looks like. The reward model encodes human judgment, and PPO optimizes for it while keeping the model grounded in its pretrained knowledge.

RLHF Limitations to Know

  • Reward hacking: The model learns to game the RM — producing responses that score well but aren't truly helpful.
  • RM imperfection: The RM is a proxy; it will disagree with humans in edge cases.
  • Value network complexity: PPO requires training a separate value (critic) network alongside the policy, which doubles complexity. → GRPO solves this
  • Human labeler bias: Preferences reflect whoever wrote them — not a universal standard of quality.
What's next? Check out the GRPO demo → to see how DeepSeek simplified RLHF by replacing PPO's value network with group-relative advantage estimation.