Capstone 04 — Domain Assistant via SFT + DPO

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–3 weeks

Demonstrates the full alignment pipeline: synthetic data generation → SFT → DPO → eval. The skill set behind every "we fine-tuned Llama for X" startup.

Goals

Pick a domain (medical Q&A, legal summarization, code review, customer support, etc.).
Generate or curate 5k–20k SFT examples + 2k–5k DPO preference pairs.
SFT a 7B base model with QLoRA.
DPO on top of the SFT model with the preference pairs.
Evaluate win-rate vs the base model via LLM-as-judge, plus retain-task scores (MMLU) to measure the alignment tax.
Ship the model + eval report + Docker for inference.

Architecture

   ┌────────────────────────────────────────────────────────────┐
   │ Stage 1: Synthetic Data Generation                         │
   │  - Seed prompts (curated by you, 50-200 examples)         │
   │  - Generate variations with a strong model (GPT-4 / Claude)│
   │  - Self-Instruct loop or domain-specific templates         │
   │  - Output: sft.jsonl (5k-20k {prompt, completion} pairs)  │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 2: SFT with QLoRA (Phase-6 lab-02 patterns)          │
   │  - Llama-3-8B (or Qwen2-7B) base                           │
   │  - QLoRA r=16, alpha=32, all linears                       │
   │  - 2-3 epochs, lr=2e-4, packing, paged AdamW               │
   │  - Output: model_sft (adapter + merged BF16)               │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 3: Preference Data Generation                        │
   │  - For each prompt, sample 2-4 completions from model_sft  │
   │  - Score with judge model OR human preferences             │
   │  - Build (prompt, chosen, rejected) triples (2k-5k)        │
   │  - Output: dpo.jsonl                                       │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 4: DPO with TRL                                      │
   │  - Initialize from model_sft                               │
   │  - β=0.1 (KL strength), lr=5e-7, 1-2 epochs                │
   │  - Output: model_dpo                                       │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 5: Evaluation                                        │
   │  - Win-rate: model_dpo vs base, judged by GPT-4            │
   │  - Win-rate: model_dpo vs model_sft                        │
   │  - MMLU 5-shot (alignment tax)                             │
   │  - Domain-specific eval (e.g., MedQA for medical)          │
   │  - Output: EVAL_REPORT.md                                  │
   └────────────────────────────────────────────────────────────┘

Suggested Stack

Component	Choice
Base	`meta-llama/Meta-Llama-3-8B` or `Qwen/Qwen2-7B`
SFT/DPO framework	`trl` (`SFTTrainer`, `DPOTrainer`)
PEFT	`peft` (LoRA, QLoRA)
Quantization	`bitsandbytes` (NF4 + double quant)
Synthetic data	OpenAI GPT-4 / Anthropic Claude as a teacher
Inference	vLLM (for sampling completions during data gen)
Eval judge	GPT-4-turbo or Claude 3.5 Sonnet
MMLU eval	`lm-evaluation-harness`
Tracking	Weights & Biases
Deploy	Docker + vLLM server

Deliverables Checklist

data/seed_prompts.json — your curated 50-200 seed examples
data/gen_sft.py — synthetic SFT generator (with rate-limiting + dedup)
data/sft.jsonl — final SFT dataset (5k-20k examples)
data/gen_dpo.py — preference-pair generator
data/dpo.jsonl — final DPO dataset (2k-5k triples)
train/sft.py — QLoRA SFT runner
train/dpo.py — DPO runner
eval/winrate.py — LLM-as-judge win-rate eval
eval/mmlu.py — alignment-tax measurement
eval/domain.py — domain-specific benchmark
EVAL_REPORT.md — table: base / sft / dpo on (winrate, MMLU, domain-bench)
MODEL_CARD.md — domain, intended use, limitations, training data composition, alignment-tax
Dockerfile + serve.sh — vLLM-based inference container
WRITEUP.md — what worked, what didn't, judge-model bias observations

Resume Bullet Pattern

Aligned Llama-3-8B to [domain] via QLoRA SFT (12k synthetic examples) + DPO (3k preference pairs); achieved 71% win rate vs base on GPT-4-judged eval with only 1.8-point MMLU degradation (alignment tax). Shipped as vLLM Docker container. [model + report]

Interview Talking Points

SFT vs DPO vs PPO: derivation of DPO's closed-form loss; why it sidesteps PPO's reward modeling.
The DPO loss: −log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x))). Be ready to whiteboard.
Synthetic data quality: dedup, diversity (n-gram coverage), avoiding teacher's stylistic tics.
Judge-model bias: position bias (judges prefer the first response), length bias (judges prefer longer), self-preference (GPT-4 prefers GPT-4-style). Mitigations: random ordering, length normalization, multi-judge ensemble.
Alignment tax: why MMLU drops after SFT/DPO; mitigations (replay buffer, mixing in pretraining data).
β in DPO: high β stays close to reference (less reward, less distortion), low β maximizes preference signal at risk of mode collapse.
Why QLoRA for both stages: memory; modular adapters; can A/B test merges.
What you'd do at 100k preference pairs: switch to PPO, or use a learned reward model + DPO/IPO.

Getting Started

Pick the domain carefully. You need to be able to evaluate it. "Better at customer support" is hard to judge; "passes more medical-fact questions" is concrete.
Curate seed prompts — 50–200, diverse, covering the range of intents.
Run Phase-6 lab-02 first to confirm your QLoRA pipeline works on a small sample.
Generate SFT data with the teacher model. Implement: rate limiting, JSON-structured outputs, exact-match dedup, near-dup MinHash dedup, length filter.
Train SFT. Validate qualitatively on 20 held-out prompts before scaling.
Generate DPO pairs: sample 4 completions from the SFT model per prompt; have judge rank them; keep best-and-worst as (chosen, rejected).
Train DPO. β=0.1, lr=5e-7 are the canonical defaults; sweep β ∈ {0.01, 0.1, 0.5} if budget allows.
Eval. Win-rate vs base + win-rate vs SFT-only + MMLU + domain bench. The win-rate vs SFT-only tells you if DPO is actually adding signal.
Containerize with vLLM. Test that curl http://localhost:8000/v1/completions works end-to-end.
Write the report. Honest. Document the failures — that's what hiring managers want to see.

LLM Inference Engineer