Capstone 04 — Domain Assistant via SFT + DPO

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–3 weeks

Demonstrates the full alignment pipeline: synthetic data generation → SFT → DPO → eval. The skill set behind every "we fine-tuned Llama for X" startup.


Goals

  1. Pick a domain (medical Q&A, legal summarization, code review, customer support, etc.).
  2. Generate or curate 5k–20k SFT examples + 2k–5k DPO preference pairs.
  3. SFT a 7B base model with QLoRA.
  4. DPO on top of the SFT model with the preference pairs.
  5. Evaluate win-rate vs the base model via LLM-as-judge, plus retain-task scores (MMLU) to measure the alignment tax.
  6. Ship the model + eval report + Docker for inference.

Architecture

   ┌────────────────────────────────────────────────────────────┐
   │ Stage 1: Synthetic Data Generation                         │
   │  - Seed prompts (curated by you, 50-200 examples)         │
   │  - Generate variations with a strong model (GPT-4 / Claude)│
   │  - Self-Instruct loop or domain-specific templates         │
   │  - Output: sft.jsonl (5k-20k {prompt, completion} pairs)  │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 2: SFT with QLoRA (Phase-6 lab-02 patterns)          │
   │  - Llama-3-8B (or Qwen2-7B) base                           │
   │  - QLoRA r=16, alpha=32, all linears                       │
   │  - 2-3 epochs, lr=2e-4, packing, paged AdamW               │
   │  - Output: model_sft (adapter + merged BF16)               │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 3: Preference Data Generation                        │
   │  - For each prompt, sample 2-4 completions from model_sft  │
   │  - Score with judge model OR human preferences             │
   │  - Build (prompt, chosen, rejected) triples (2k-5k)        │
   │  - Output: dpo.jsonl                                       │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 4: DPO with TRL                                      │
   │  - Initialize from model_sft                               │
   │  - β=0.1 (KL strength), lr=5e-7, 1-2 epochs                │
   │  - Output: model_dpo                                       │
   └────────────────────┬───────────────────────────────────────┘
                        ▼
   ┌────────────────────────────────────────────────────────────┐
   │ Stage 5: Evaluation                                        │
   │  - Win-rate: model_dpo vs base, judged by GPT-4            │
   │  - Win-rate: model_dpo vs model_sft                        │
   │  - MMLU 5-shot (alignment tax)                             │
   │  - Domain-specific eval (e.g., MedQA for medical)          │
   │  - Output: EVAL_REPORT.md                                  │
   └────────────────────────────────────────────────────────────┘

Suggested Stack

ComponentChoice
Basemeta-llama/Meta-Llama-3-8B or Qwen/Qwen2-7B
SFT/DPO frameworktrl (SFTTrainer, DPOTrainer)
PEFTpeft (LoRA, QLoRA)
Quantizationbitsandbytes (NF4 + double quant)
Synthetic dataOpenAI GPT-4 / Anthropic Claude as a teacher
InferencevLLM (for sampling completions during data gen)
Eval judgeGPT-4-turbo or Claude 3.5 Sonnet
MMLU evallm-evaluation-harness
TrackingWeights & Biases
DeployDocker + vLLM server

Deliverables Checklist

  • data/seed_prompts.json — your curated 50-200 seed examples
  • data/gen_sft.py — synthetic SFT generator (with rate-limiting + dedup)
  • data/sft.jsonl — final SFT dataset (5k-20k examples)
  • data/gen_dpo.py — preference-pair generator
  • data/dpo.jsonl — final DPO dataset (2k-5k triples)
  • train/sft.py — QLoRA SFT runner
  • train/dpo.py — DPO runner
  • eval/winrate.py — LLM-as-judge win-rate eval
  • eval/mmlu.py — alignment-tax measurement
  • eval/domain.py — domain-specific benchmark
  • EVAL_REPORT.md — table: base / sft / dpo on (winrate, MMLU, domain-bench)
  • MODEL_CARD.md — domain, intended use, limitations, training data composition, alignment-tax
  • Dockerfile + serve.sh — vLLM-based inference container
  • WRITEUP.md — what worked, what didn't, judge-model bias observations

Resume Bullet Pattern

Aligned Llama-3-8B to [domain] via QLoRA SFT (12k synthetic examples) + DPO (3k preference pairs); achieved 71% win rate vs base on GPT-4-judged eval with only 1.8-point MMLU degradation (alignment tax). Shipped as vLLM Docker container. [model + report]


Interview Talking Points

  • SFT vs DPO vs PPO: derivation of DPO's closed-form loss; why it sidesteps PPO's reward modeling.
  • The DPO loss: −log σ(β · (log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x))). Be ready to whiteboard.
  • Synthetic data quality: dedup, diversity (n-gram coverage), avoiding teacher's stylistic tics.
  • Judge-model bias: position bias (judges prefer the first response), length bias (judges prefer longer), self-preference (GPT-4 prefers GPT-4-style). Mitigations: random ordering, length normalization, multi-judge ensemble.
  • Alignment tax: why MMLU drops after SFT/DPO; mitigations (replay buffer, mixing in pretraining data).
  • β in DPO: high β stays close to reference (less reward, less distortion), low β maximizes preference signal at risk of mode collapse.
  • Why QLoRA for both stages: memory; modular adapters; can A/B test merges.
  • What you'd do at 100k preference pairs: switch to PPO, or use a learned reward model + DPO/IPO.

Getting Started

  1. Pick the domain carefully. You need to be able to evaluate it. "Better at customer support" is hard to judge; "passes more medical-fact questions" is concrete.
  2. Curate seed prompts — 50–200, diverse, covering the range of intents.
  3. Run Phase-6 lab-02 first to confirm your QLoRA pipeline works on a small sample.
  4. Generate SFT data with the teacher model. Implement: rate limiting, JSON-structured outputs, exact-match dedup, near-dup MinHash dedup, length filter.
  5. Train SFT. Validate qualitatively on 20 held-out prompts before scaling.
  6. Generate DPO pairs: sample 4 completions from the SFT model per prompt; have judge rank them; keep best-and-worst as (chosen, rejected).
  7. Train DPO. β=0.1, lr=5e-7 are the canonical defaults; sweep β ∈ {0.01, 0.1, 0.5} if budget allows.
  8. Eval. Win-rate vs base + win-rate vs SFT-only + MMLU + domain bench. The win-rate vs SFT-only tells you if DPO is actually adding signal.
  9. Containerize with vLLM. Test that curl http://localhost:8000/v1/completions works end-to-end.
  10. Write the report. Honest. Document the failures — that's what hiring managers want to see.