Phase 6 — Fine-tuning, Instruction Tuning, Preference Optimization

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2.5 weeks Roles supported: Post-training Engineer, Production Model Post-Training (Anthropic-style), Applied AI Engineer.


Why This Phase Exists

The frontier-lab post-training stack — SFT → reward model → preference optimization — is what turns a base LM into Claude / ChatGPT / Gemini. Anthropic's "Production Model Post-Training" role explicitly asks for hands-on experience with this exact pipeline.

You will fine-tune a real 7B model on a single 24 GB GPU using QLoRA, then run DPO with a preference dataset, and produce a quantitative before/after eval.


Concepts

  • Pretraining vs SFT vs preference optimization
  • Chat templates (ChatML, Llama-3, Mistral) — and why they matter
  • Loss masking on prompt tokens
  • LoRA: low-rank adapters, math, parameter savings (A ∈ R^{d×r}, B ∈ R^{r×d})
  • QLoRA: 4-bit base + LoRA on top, NF4 quantization, double quantization
  • PEFT library mechanics
  • Reward modeling: pairwise loss, Bradley-Terry assumption
  • RLHF / PPO conceptual flow (without implementing PPO end-to-end)
  • DPO derivation from RLHF objective
  • IPO, KTO, ORPO — the DPO family
  • RLAIF (AI feedback) and Constitutional AI overview
  • Catastrophic forgetting & mitigation

Labs

Lab 01 — Supervised Fine-Tuning (SFT) on Instruction Data

FieldValue
GoalFine-tune a small base model (e.g., Qwen2-0.5B or Phi-3-mini) on an instruction dataset.
ConceptsChat templates, prompt-response loss masking, padding strategies, eval during training.
Steps1) Load Qwen2-0.5B base. 2) Load databricks/databricks-dolly-15k or OpenAssistant/oasst1. 3) Apply chat template. 4) Mask loss on prompt tokens. 5) Train 1–2 epochs with HF Trainer. 6) Eval on held-out instructions qualitatively + with MT-Bench-lite.
StackHF transformers, datasets, trl.SFTTrainer, W&B
Datasetsdolly-15k (15k examples), oasst1, alpaca-cleaned
OutputA fine-tuned checkpoint that follows instructions noticeably better than the base.
How to TestSide-by-side generation on 20 held-out prompts; manual rating + MT-Bench-lite.
Talking PointsWhy mask loss on prompt tokens. Why chat templates matter (token-level boundary marking). Catastrophic-forgetting risk.
Resume Bullet"Performed supervised fine-tuning of Qwen2-0.5B on dolly-15k with chat-template-correct loss masking; lifted instruction-following win rate vs base from 23% to 71% on a 50-prompt human eval."
ExtensionsAdd domain-specific synthetic data (preview of Capstone 4).

Lab 02 — LoRA & QLoRA on a 7B Model (Single GPU)

FieldValue
GoalFine-tune Llama-3-8B or Qwen2-7B on a single 24 GB GPU using QLoRA.
ConceptsLoRA decomposition ΔW = BA, rank/alpha selection, target modules (q_proj, v_proj, o_proj, MLP), NF4 quantization, paged optimizers.
Steps1) Load 7B base in 4-bit (BitsAndBytesConfig NF4). 2) Wrap with LoraConfig (r=16, alpha=32). 3) Train on a domain dataset (legal Q&A, code, medical — your choice). 4) Save adapter (only ~50 MB). 5) Merge + reload for inference. 6) Compare param-count overhead.
Stacktransformers, peft, trl, bitsandbytes, accelerate
DatasetsPick a domain — nvidia/HelpSteer2 for general; code_alpaca_20k for code; etc.
OutputLoRA adapter, merged model, before/after generation comparison.
How to TestVRAM stays under 22 GB during training; perplexity improves on held-out domain data.
Talking PointsLoRA math (rank decomposition reduces params from to 2dr). Why QLoRA = 4-bit base + 16-bit adapters. When to use higher rank. Why NF4 > FP4.
Resume Bullet"Fine-tuned Llama-3-8B with QLoRA (NF4 + LoRA r=16) on a 24 GB consumer GPU, training only 0.18% of parameters; achieved 14% perplexity reduction on held-out domain data with 52 MB adapter footprint."
ExtensionsTry LoRA+ (different LR for B vs A); try DoRA (decomposed LoRA).

Lab 03 — Building an Instruction Dataset (Synthetic + Curated)

FieldValue
GoalBuild a 5k-example domain instruction dataset with synthetic generation + filtering.
ConceptsSelf-Instruct, Evol-Instruct, distillation from a stronger model, dedup, quality filtering, contamination checks.
Steps1) Seed with 50 hand-written examples. 2) Use a stronger model (Claude / GPT-4 / open Llama-3-70B via Together) to generate variations. 3) Dedup via MinHash or embedding similarity. 4) Filter by length / language / quality heuristics. 5) Output JSONL with {instruction, input, output}.
StackOpenAI / Anthropic API or Together AI, datasketch, sentence-transformers
DatasetsYour own seed
OutputA 5k-example JSONL with a quality report.
How to TestManual rating on a 50-example sample; downstream Lab 02 finetune improves vs baseline data.
Talking PointsSynthetic-data risks (mode collapse, model bias inheritance). Why dedup matters. License implications of distillation.
Resume Bullet"Built a 5k-example domain-specific instruction dataset via self-instruct + MinHash dedup + length/quality filters; downstream SFT showed 9-point lift over a generic dataset baseline."
ExtensionsAdd diversity-driven sampling (cluster + sample); contamination check against eval sets.

Lab 04 — Reward Modeling + DPO Preference Optimization

FieldValue
GoalRun DPO on a preference dataset; understand its derivation from RLHF.
ConceptsReward modeling (pairwise loss), Bradley-Terry, DPO loss derivation, β hyperparameter, reference model.
Steps1) (Conceptual) Implement reward-model pairwise loss in 20 lines. 2) Use trl.DPOTrainer. 3) Load Anthropic/hh-rlhf or Intel/orca_dpo_pairs. 4) Run DPO on the SFT model from Lab 1. 5) Eval before/after on a preference test set + MT-Bench-lite.
Stacktrl.DPOTrainer, transformers, peft
DatasetsAnthropic/hh-rlhf, argilla/distilabel-intel-orca-dpo-pairs, HuggingFaceH4/ultrafeedback_binarized
OutputA DPO-trained model with measurable preference-win-rate improvement.
How to TestPairwise win rate vs SFT baseline > 60% on held-out preference pairs.
Talking PointsWhy DPO doesn't need a separate reward model (closed-form policy from BT preferences). β controls deviation from reference. Why DPO is more stable than PPO. Compare DPO vs IPO vs KTO.
Resume Bullet"Implemented DPO preference optimization on a Qwen2-SFT checkpoint using HH-RLHF; achieved 67% pairwise win-rate vs SFT baseline on held-out preferences with β=0.1 and a 4× lower compute footprint than PPO."
ExtensionsTry IPO (handles preference noise); try KTO (works with unpaired data); analyze reward hacking.

Deliverables Checklist

  • SFT-trained small model with eval comparison
  • QLoRA fine-tune of 7B on 24 GB GPU
  • 5k-example synthetic instruction dataset
  • DPO-trained model with preference win-rate report

Interview Relevance

  • "Compare SFT, RLHF, DPO"
  • "Walk through LoRA math"
  • "Why does QLoRA work? What's NF4?"
  • "Derive the DPO loss"
  • "How would you build a preference dataset?"