Capstone 04 — Domain Assistant via SFT + DPO
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–3 weeks
Demonstrates the full alignment pipeline: synthetic data generation → SFT → DPO → eval. The skill set behind every "we fine-tuned Llama for X" startup.
Goals
- Pick a domain (medical Q&A, legal summarization, code review, customer support, etc.).
- Generate or curate 5k–20k SFT examples + 2k–5k DPO preference pairs.
- SFT a 7B base model with QLoRA.
- DPO on top of the SFT model with the preference pairs.
- Evaluate win-rate vs the base model via LLM-as-judge, plus retain-task scores (MMLU) to measure the alignment tax.
- Ship the model + eval report + Docker for inference.
Architecture
┌────────────────────────────────────────────────────────────┐
│ Stage 1: Synthetic Data Generation │
│ - Seed prompts (curated by you, 50-200 examples) │
│ - Generate variations with a strong model (GPT-4 / Claude)│
│ - Self-Instruct loop or domain-specific templates │
│ - Output: sft.jsonl (5k-20k {prompt, completion} pairs) │
└────────────────────┬───────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────┐
│ Stage 2: SFT with QLoRA (Phase-6 lab-02 patterns) │
│ - Llama-3-8B (or Qwen2-7B) base │
│ - QLoRA r=16, alpha=32, all linears │
│ - 2-3 epochs, lr=2e-4, packing, paged AdamW │
│ - Output: model_sft (adapter + merged BF16) │
└────────────────────┬───────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────┐
│ Stage 3: Preference Data Generation │
│ - For each prompt, sample 2-4 completions from model_sft │
│ - Score with judge model OR human preferences │
│ - Build (prompt, chosen, rejected) triples (2k-5k) │
│ - Output: dpo.jsonl │
└────────────────────┬───────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────┐
│ Stage 4: DPO with TRL │
│ - Initialize from model_sft │
│ - β=0.1 (KL strength), lr=5e-7, 1-2 epochs │
│ - Output: model_dpo │
└────────────────────┬───────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────┐
│ Stage 5: Evaluation │
│ - Win-rate: model_dpo vs base, judged by GPT-4 │
│ - Win-rate: model_dpo vs model_sft │
│ - MMLU 5-shot (alignment tax) │
│ - Domain-specific eval (e.g., MedQA for medical) │
│ - Output: EVAL_REPORT.md │
└────────────────────────────────────────────────────────────┘
Suggested Stack
| Component | Choice |
|---|---|
| Base | meta-llama/Meta-Llama-3-8B or Qwen/Qwen2-7B |
| SFT/DPO framework | trl (SFTTrainer, DPOTrainer) |
| PEFT | peft (LoRA, QLoRA) |
| Quantization | bitsandbytes (NF4 + double quant) |
| Synthetic data | OpenAI GPT-4 / Anthropic Claude as a teacher |
| Inference | vLLM (for sampling completions during data gen) |
| Eval judge | GPT-4-turbo or Claude 3.5 Sonnet |
| MMLU eval | lm-evaluation-harness |
| Tracking | Weights & Biases |
| Deploy | Docker + vLLM server |
Deliverables Checklist
-
data/seed_prompts.json— your curated 50-200 seed examples -
data/gen_sft.py— synthetic SFT generator (with rate-limiting + dedup) -
data/sft.jsonl— final SFT dataset (5k-20k examples) -
data/gen_dpo.py— preference-pair generator -
data/dpo.jsonl— final DPO dataset (2k-5k triples) -
train/sft.py— QLoRA SFT runner -
train/dpo.py— DPO runner -
eval/winrate.py— LLM-as-judge win-rate eval -
eval/mmlu.py— alignment-tax measurement -
eval/domain.py— domain-specific benchmark -
EVAL_REPORT.md— table: base / sft / dpo on (winrate, MMLU, domain-bench) -
MODEL_CARD.md— domain, intended use, limitations, training data composition, alignment-tax -
Dockerfile+serve.sh— vLLM-based inference container -
WRITEUP.md— what worked, what didn't, judge-model bias observations
Resume Bullet Pattern
Aligned Llama-3-8B to [domain] via QLoRA SFT (12k synthetic examples) + DPO (3k preference pairs); achieved 71% win rate vs base on GPT-4-judged eval with only 1.8-point MMLU degradation (alignment tax). Shipped as vLLM Docker container. [model + report]
Interview Talking Points
- SFT vs DPO vs PPO: derivation of DPO's closed-form loss; why it sidesteps PPO's reward modeling.
- The DPO loss:
−log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x))). Be ready to whiteboard. - Synthetic data quality: dedup, diversity (n-gram coverage), avoiding teacher's stylistic tics.
- Judge-model bias: position bias (judges prefer the first response), length bias (judges prefer longer), self-preference (GPT-4 prefers GPT-4-style). Mitigations: random ordering, length normalization, multi-judge ensemble.
- Alignment tax: why MMLU drops after SFT/DPO; mitigations (replay buffer, mixing in pretraining data).
- β in DPO: high β stays close to reference (less reward, less distortion), low β maximizes preference signal at risk of mode collapse.
- Why QLoRA for both stages: memory; modular adapters; can A/B test merges.
- What you'd do at 100k preference pairs: switch to PPO, or use a learned reward model + DPO/IPO.
Getting Started
- Pick the domain carefully. You need to be able to evaluate it. "Better at customer support" is hard to judge; "passes more medical-fact questions" is concrete.
- Curate seed prompts — 50–200, diverse, covering the range of intents.
- Run Phase-6 lab-02 first to confirm your QLoRA pipeline works on a small sample.
- Generate SFT data with the teacher model. Implement: rate limiting, JSON-structured outputs, exact-match dedup, near-dup MinHash dedup, length filter.
- Train SFT. Validate qualitatively on 20 held-out prompts before scaling.
- Generate DPO pairs: sample 4 completions from the SFT model per prompt; have judge rank them; keep best-and-worst as (chosen, rejected).
- Train DPO. β=0.1, lr=5e-7 are the canonical defaults; sweep β ∈ {0.01, 0.1, 0.5} if budget allows.
- Eval. Win-rate vs base + win-rate vs SFT-only + MMLU + domain bench. The win-rate vs SFT-only tells you if DPO is actually adding signal.
- Containerize with vLLM. Test that
curl http://localhost:8000/v1/completionsworks end-to-end. - Write the report. Honest. Document the failures — that's what hiring managers want to see.