Phase 6 — Fine-tuning, Instruction Tuning, Preference Optimization

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2.5 weeks Roles supported: Post-training Engineer, Production Model Post-Training (Anthropic-style), Applied AI Engineer.

Why This Phase Exists

The frontier-lab post-training stack — SFT → reward model → preference optimization — is what turns a base LM into Claude / ChatGPT / Gemini. Anthropic's "Production Model Post-Training" role explicitly asks for hands-on experience with this exact pipeline.

You will fine-tune a real 7B model on a single 24 GB GPU using QLoRA, then run DPO with a preference dataset, and produce a quantitative before/after eval.

Concepts

Pretraining vs SFT vs preference optimization
Chat templates (ChatML, Llama-3, Mistral) — and why they matter
Loss masking on prompt tokens
LoRA: low-rank adapters, math, parameter savings (A ∈ R^{d×r}, B ∈ R^{r×d})
QLoRA: 4-bit base + LoRA on top, NF4 quantization, double quantization
PEFT library mechanics
Reward modeling: pairwise loss, Bradley-Terry assumption
RLHF / PPO conceptual flow (without implementing PPO end-to-end)
DPO derivation from RLHF objective
IPO, KTO, ORPO — the DPO family
RLAIF (AI feedback) and Constitutional AI overview
Catastrophic forgetting & mitigation

Labs

Lab 01 — Supervised Fine-Tuning (SFT) on Instruction Data

Field	Value
Goal	Fine-tune a small base model (e.g., Qwen2-0.5B or Phi-3-mini) on an instruction dataset.
Concepts	Chat templates, prompt-response loss masking, padding strategies, eval during training.
Steps	1) Load Qwen2-0.5B base. 2) Load `databricks/databricks-dolly-15k` or `OpenAssistant/oasst1`. 3) Apply chat template. 4) Mask loss on prompt tokens. 5) Train 1–2 epochs with HF `Trainer`. 6) Eval on held-out instructions qualitatively + with MT-Bench-lite.
Stack	HF `transformers`, `datasets`, `trl.SFTTrainer`, W&B
Datasets	dolly-15k (15k examples), oasst1, alpaca-cleaned
Output	A fine-tuned checkpoint that follows instructions noticeably better than the base.
How to Test	Side-by-side generation on 20 held-out prompts; manual rating + MT-Bench-lite.
Talking Points	Why mask loss on prompt tokens. Why chat templates matter (token-level boundary marking). Catastrophic-forgetting risk.
Resume Bullet	"Performed supervised fine-tuning of Qwen2-0.5B on dolly-15k with chat-template-correct loss masking; lifted instruction-following win rate vs base from 23% to 71% on a 50-prompt human eval."
Extensions	Add domain-specific synthetic data (preview of Capstone 4).

Lab 02 — LoRA & QLoRA on a 7B Model (Single GPU)

Field	Value
Goal	Fine-tune Llama-3-8B or Qwen2-7B on a single 24 GB GPU using QLoRA.
Concepts	LoRA decomposition `ΔW = BA`, rank/alpha selection, target modules (`q_proj`, `v_proj`, `o_proj`, MLP), NF4 quantization, paged optimizers.
Steps	1) Load 7B base in 4-bit (`BitsAndBytesConfig` NF4). 2) Wrap with `LoraConfig` (r=16, alpha=32). 3) Train on a domain dataset (legal Q&A, code, medical — your choice). 4) Save adapter (only ~50 MB). 5) Merge + reload for inference. 6) Compare param-count overhead.
Stack	`transformers`, `peft`, `trl`, `bitsandbytes`, `accelerate`
Datasets	Pick a domain — `nvidia/HelpSteer2` for general; `code_alpaca_20k` for code; etc.
Output	LoRA adapter, merged model, before/after generation comparison.
How to Test	VRAM stays under 22 GB during training; perplexity improves on held-out domain data.
Talking Points	LoRA math (rank decomposition reduces params from `d²` to `2dr`). Why QLoRA = 4-bit base + 16-bit adapters. When to use higher rank. Why NF4 > FP4.
Resume Bullet	"Fine-tuned Llama-3-8B with QLoRA (NF4 + LoRA r=16) on a 24 GB consumer GPU, training only 0.18% of parameters; achieved 14% perplexity reduction on held-out domain data with 52 MB adapter footprint."
Extensions	Try LoRA+ (different LR for B vs A); try DoRA (decomposed LoRA).

Lab 03 — Building an Instruction Dataset (Synthetic + Curated)

Field	Value
Goal	Build a 5k-example domain instruction dataset with synthetic generation + filtering.
Concepts	Self-Instruct, Evol-Instruct, distillation from a stronger model, dedup, quality filtering, contamination checks.
Steps	1) Seed with 50 hand-written examples. 2) Use a stronger model (Claude / GPT-4 / open Llama-3-70B via Together) to generate variations. 3) Dedup via MinHash or embedding similarity. 4) Filter by length / language / quality heuristics. 5) Output JSONL with `{instruction, input, output}`.
Stack	OpenAI / Anthropic API or Together AI, `datasketch`, `sentence-transformers`
Datasets	Your own seed
Output	A 5k-example JSONL with a quality report.
How to Test	Manual rating on a 50-example sample; downstream Lab 02 finetune improves vs baseline data.
Talking Points	Synthetic-data risks (mode collapse, model bias inheritance). Why dedup matters. License implications of distillation.
Resume Bullet	"Built a 5k-example domain-specific instruction dataset via self-instruct + MinHash dedup + length/quality filters; downstream SFT showed 9-point lift over a generic dataset baseline."
Extensions	Add diversity-driven sampling (cluster + sample); contamination check against eval sets.

Lab 04 — Reward Modeling + DPO Preference Optimization

Field	Value
Goal	Run DPO on a preference dataset; understand its derivation from RLHF.
Concepts	Reward modeling (pairwise loss), Bradley-Terry, DPO loss derivation, β hyperparameter, reference model.
Steps	1) (Conceptual) Implement reward-model pairwise loss in 20 lines. 2) Use `trl.DPOTrainer`. 3) Load `Anthropic/hh-rlhf` or `Intel/orca_dpo_pairs`. 4) Run DPO on the SFT model from Lab 1. 5) Eval before/after on a preference test set + MT-Bench-lite.
Stack	`trl.DPOTrainer`, `transformers`, `peft`
Datasets	`Anthropic/hh-rlhf`, `argilla/distilabel-intel-orca-dpo-pairs`, `HuggingFaceH4/ultrafeedback_binarized`
Output	A DPO-trained model with measurable preference-win-rate improvement.
How to Test	Pairwise win rate vs SFT baseline > 60% on held-out preference pairs.
Talking Points	Why DPO doesn't need a separate reward model (closed-form policy from BT preferences). β controls deviation from reference. Why DPO is more stable than PPO. Compare DPO vs IPO vs KTO.
Resume Bullet	"Implemented DPO preference optimization on a Qwen2-SFT checkpoint using HH-RLHF; achieved 67% pairwise win-rate vs SFT baseline on held-out preferences with β=0.1 and a 4× lower compute footprint than PPO."
Extensions	Try IPO (handles preference noise); try KTO (works with unpaired data); analyze reward hacking.

Deliverables Checklist

SFT-trained small model with eval comparison
QLoRA fine-tune of 7B on 24 GB GPU
5k-example synthetic instruction dataset
DPO-trained model with preference win-rate report

Interview Relevance

"Compare SFT, RLHF, DPO"
"Walk through LoRA math"
"Why does QLoRA work? What's NF4?"
"Derive the DPO loss"
"How would you build a preference dataset?"

LLM Inference Engineer