Capstone 08 — Full RLHF Pipeline (Reward Model + PPO)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 3–4 weeks

Real-world parallel: the alignment pipeline behind ChatGPT, Claude (RLHF / RLAIF / Constitutional AI), Gemini, Llama-3-Instruct. The capstone for alignment / post-training roles at frontier labs. Complements Capstone-04 (DPO) by going through the full PPO path InstructGPT used.

Goals

Reproduce the InstructGPT / Llama-2-Chat post-training recipe end-to-end on a 7B base model:

SFT on instruction-following data (reuse Capstone-04's pipeline).
Reward Model (RM) training: Bradley-Terry pairwise loss on preference data.
PPO with KL penalty: optimize the SFT model against the RM, with KL anchor to SFT.
Comparison vs DPO (Capstone-04): which produced better win-rate, at what compute cost?
Bonus: Constitutional AI / RLAIF: replace the human preference labels with model-generated critiques (Anthropic's CAI recipe).
Eval suite: win-rate vs SFT, MMLU (alignment tax), Anthropic HH-RLHF eval, reward-hacking detection.

Architecture

   ┌──────────────────────────────────────────────────────────────┐
   │ Stage 1: SFT (reused from Capstone-04)                       │
   │  Llama-3-8B base → SFT model π_sft                           │
   └─────────────────────┬────────────────────────────────────────┘
                         ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ Stage 2: Reward Model                                        │
   │  - Init from π_sft (or smaller if compute-bound)             │
   │  - Replace LM head with scalar value head                    │
   │  - Bradley-Terry loss:                                       │
   │      L = -log σ(r(x, y_chosen) - r(x, y_rejected))           │
   │  - Train on Anthropic HH-RLHF or your own preferences        │
   │  - Output: reward model r_φ                                  │
   └─────────────────────┬────────────────────────────────────────┘
                         ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ Stage 3: PPO Training                                        │
   │                                                              │
   │  For each step:                                              │
   │   1. Sample prompts → generate y from π_θ (current policy)   │
   │   2. Score y with r_φ → scalar reward                        │
   │   3. Compute KL(π_θ || π_sft) per token (KL penalty)         │
   │   4. Total reward: r_φ(x,y) - β·KL(π_θ || π_sft)             │
   │   5. PPO update with GAE advantages, clipped ratio           │
   │                                                              │
   │  Components:                                                 │
   │   - π_θ:    policy (LoRA on π_sft, trainable)                │
   │   - π_ref:  frozen reference (= π_sft, for KL anchor)        │
   │   - V_ψ:    value head (trainable, on top of π_θ)            │
   │   - r_φ:    frozen reward model                              │
   └─────────────────────┬────────────────────────────────────────┘
                         ▼
   ┌──────────────────────────────────────────────────────────────┐
   │ Eval: π_ppo vs π_sft vs π_dpo (Capstone-04)                  │
   │  - GPT-4-judged win-rate                                     │
   │  - Reward score on held-out prompts (overfitting check)      │
   │  - MMLU 5-shot (alignment tax)                               │
   │  - Reward-hacking detection (length explosion, sycophancy)   │
   └──────────────────────────────────────────────────────────────┘

Suggested Stack

Component	Choice
Base	Llama-3-8B (or Qwen2-7B for permissive license)
SFT data	Reuse Capstone-04 SFT data
Preference data	Anthropic HH-RLHF (`Anthropic/hh-rlhf`) or `argilla/distilabel-...`
Framework	`trl` (`RewardTrainer`, `PPOTrainer`); `peft` for LoRA
Quantization	QLoRA NF4 for memory (3 model copies in PPO is brutal)
Tracking	Weights & Biases (PPO needs very detailed logs)
Eval judge	GPT-4-turbo (with position-bias controls)
Compute	4× A100 80GB minimum; 8× preferred

Deliverables Checklist

Reward Model

rm/data.py — preference-pair loader, length filtering
rm/model.py — value-head wrapper around base model
rm/train.py — Bradley-Terry loss training loop
rm/eval.py — accuracy on held-out preferences (target ≥ 70%); calibration plot
rm/MODEL_CARD.md — known biases (length, sycophancy proxies)

PPO

ppo/ppo_trainer.py — full GAE + clipped-ratio PPO with KL penalty
ppo/rollout.py — efficient batched generation for rollouts
ppo/value_head.py — scalar value prediction
ppo/configs/llama3_8b.yaml — every hyperparameter
ppo/diagnostics/ — KL divergence, reward, value loss, policy loss, response length over time

Optional: RLAIF / Constitutional AI

cai/constitution.md — your principles (e.g., helpful, harmless, honest)
cai/critique_revise.py — model self-critiques and revises a response
cai/preference_gen.py — model-generated preferences from critiques

Evaluation

eval/winrate.py — judge eval with random ordering, length-control, multi-judge
eval/reward_hacking.py — detect length blow-up, repetition, formatting tics, refusal explosion
EVAL_REPORT.md — π_sft vs π_dpo vs π_ppo, by metric, with cost table

Production

Merged π_ppo BF16 model
Inference container (vLLM)
WRITEUP.md — what failed (PPO will fail many times); how you diagnosed each

Resume Bullet Pattern

Implemented full RLHF pipeline (SFT → reward model → PPO with KL anchor) on Llama-3-8B; achieved 64% GPT-4-judged win-rate vs SFT baseline with controlled 1.5-point MMLU alignment tax. Compared head-to-head with DPO on identical data, finding PPO +3% win-rate at 6× compute cost. [report + model]

Interview Talking Points

The PPO objective in full: $\max_\theta \mathbb{E}{x \sim D, y \sim \pi\theta}[r_\phi(x, y)] - \beta \cdot \text{KL}(\pi_\theta | \pi_{\text{ref}})$. Per-token implementation details.
GAE (Generalized Advantage Estimation): $\hat{A}t = \sum{l=0}^{T-t-1} (\gamma \lambda)^l \delta_{t+l}$ where $\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$. Why $\lambda \approx 0.95$ in practice.
PPO clipped objective: $\min(r_t \hat{A}t, \text{clip}(r_t, 1-\epsilon, 1+\epsilon)\hat{A}t)$ where $r_t = \pi\theta(a_t|s_t) / \pi\text{old}(a_t|s_t)$. Why clipping prevents catastrophic updates.
DPO derivation: closed-form solution to the same KL-constrained objective; how it bypasses the reward model. When PPO still wins (online exploration of preferences).
Reward hacking taxonomy: length explosion (more tokens = more reward), formatting tics (bullet points score high), sycophancy ("Great question!"), refusal escalation. Mitigations: length-normalized reward, RM ensembling, on-policy data collection.
KL coefficient tuning: too low → policy drifts, reward hacks; too high → no learning. Adaptive KL controllers (target-KL).
Reward model quality bottleneck: PPO can only be as good as r_φ. Why preference data quality and RM ensembling matter more than PPO knobs.
Memory architecture of PPO: 4 model copies (policy, ref, value, RM); LoRA + shared frozen base reduces this drastically. How to sequence the forward passes.
Constitutional AI / RLAIF: replacing humans with the model itself for preference labeling — Anthropic's recipe. When it works (broad principles) vs fails (subjective taste).
The RLHF ROI debate (2024–2026): is DPO/IPO/KTO actually as good as PPO at lower complexity? Your benchmark contributes data.

Getting Started

Reuse Capstone-04 SFT. Don't redo it.
Build the reward model first. Easiest stage; clean signal. Train on 50k HH-RLHF pairs. Target accuracy ≥ 70% on held-out.
Sanity-check the RM: generate 5 chosen + 5 obviously-bad completions for 10 prompts; verify chosen consistently scores higher.
Set up PPO at miniature scale first: 1 GPU, 1B model (TinyLlama), 200 prompts. Get the loop working before scaling.
Watch KL divergence like a hawk. If it explodes after a few steps, your KL coefficient is too low or your value function is broken.
Scale to Llama-3-8B with QLoRA. 4× A100 80GB minimum. Total: ~3–5 days of training time.
Run reward-hacking diagnostics every 100 steps. Length plot, RM-train vs RM-eval reward gap (overfitting), refusal rate.
Eval rigorously: position-bias control (random ordering), length control (tell judge to ignore length), multi-judge ensemble (Sonnet + GPT-4).
Compare to your DPO model from Capstone-04. Honest table. If DPO matches PPO at less compute, that's the most interesting result you can publish.
Write up the failures. Every RLHF practitioner has stories of mode collapse, reward hacking, KL explosion. Yours will be valuable.

Stretch Goals

DPO / IPO / KTO ablation: implement all three on the same data; one plot showing tradeoffs.
Iterative DPO / Online DPO: round 1 DPO → sample new responses → re-label → round 2 DPO. Closes the gap to PPO.
Process reward models (PRM): step-level rewards for math/code (vs final-answer outcome reward). Foundation for OpenAI o1-style reasoning RL.
GRPO (Group Relative Policy Optimization, DeepSeekMath): no value head, group-baseline normalized rewards. Memory-efficient.
RM ensemble + uncertainty-weighted reward: reduces reward hacking measurably.
Multi-objective reward (helpfulness + harmlessness as separate heads, weighted in PPO).
Constitutional AI end-to-end: zero human preference labels, pure RLAIF. Compare to RLHF.

What This Capstone Proves About You

You can implement and debug the most complex training pipeline in modern AI. You understand the math (Bradley-Terry, GAE, PPO clip, KL constraint), the engineering (4 model copies, careful memory management), and the empirics (reward hacking, KL explosion, judge bias). You can articulate when DPO/IPO/KTO suffice and when full PPO is worth the complexity.

This is the bar for Alignment Engineer / Post-Training Researcher roles at Anthropic (the inventors of CAI), OpenAI (RLHF originators), DeepMind, Meta (Llama post-training), and any frontier lab building aligned models. Vanishingly few engineers have actually shipped full RLHF — having it on your portfolio is rare signal.

LLM Inference Engineer