The frontier-lab post-training stack — SFT → reward model → preference optimization — is what turns a base LM into Claude / ChatGPT / Gemini. Anthropic's "Production Model Post-Training" role explicitly asks for hands-on experience with this exact pipeline.
You will fine-tune a real 7B model on a single 24 GB GPU using QLoRA, then run DPO with a preference dataset, and produce a quantitative before/after eval.
Fine-tune a small base model (e.g., Qwen2-0.5B or Phi-3-mini) on an instruction dataset.
Concepts
Chat templates, prompt-response loss masking, padding strategies, eval during training.
Steps
1) Load Qwen2-0.5B base. 2) Load databricks/databricks-dolly-15k or OpenAssistant/oasst1. 3) Apply chat template. 4) Mask loss on prompt tokens. 5) Train 1–2 epochs with HF Trainer. 6) Eval on held-out instructions qualitatively + with MT-Bench-lite.
Stack
HF transformers, datasets, trl.SFTTrainer, W&B
Datasets
dolly-15k (15k examples), oasst1, alpaca-cleaned
Output
A fine-tuned checkpoint that follows instructions noticeably better than the base.
How to Test
Side-by-side generation on 20 held-out prompts; manual rating + MT-Bench-lite.
Talking Points
Why mask loss on prompt tokens. Why chat templates matter (token-level boundary marking). Catastrophic-forgetting risk.
Resume Bullet
"Performed supervised fine-tuning of Qwen2-0.5B on dolly-15k with chat-template-correct loss masking; lifted instruction-following win rate vs base from 23% to 71% on a 50-prompt human eval."
Extensions
Add domain-specific synthetic data (preview of Capstone 4).
1) Load 7B base in 4-bit (BitsAndBytesConfig NF4). 2) Wrap with LoraConfig (r=16, alpha=32). 3) Train on a domain dataset (legal Q&A, code, medical — your choice). 4) Save adapter (only ~50 MB). 5) Merge + reload for inference. 6) Compare param-count overhead.
Stack
transformers, peft, trl, bitsandbytes, accelerate
Datasets
Pick a domain — nvidia/HelpSteer2 for general; code_alpaca_20k for code; etc.
VRAM stays under 22 GB during training; perplexity improves on held-out domain data.
Talking Points
LoRA math (rank decomposition reduces params from d² to 2dr). Why QLoRA = 4-bit base + 16-bit adapters. When to use higher rank. Why NF4 > FP4.
Resume Bullet
"Fine-tuned Llama-3-8B with QLoRA (NF4 + LoRA r=16) on a 24 GB consumer GPU, training only 0.18% of parameters; achieved 14% perplexity reduction on held-out domain data with 52 MB adapter footprint."
Extensions
Try LoRA+ (different LR for B vs A); try DoRA (decomposed LoRA).
Build a 5k-example domain instruction dataset with synthetic generation + filtering.
Concepts
Self-Instruct, Evol-Instruct, distillation from a stronger model, dedup, quality filtering, contamination checks.
Steps
1) Seed with 50 hand-written examples. 2) Use a stronger model (Claude / GPT-4 / open Llama-3-70B via Together) to generate variations. 3) Dedup via MinHash or embedding similarity. 4) Filter by length / language / quality heuristics. 5) Output JSONL with {instruction, input, output}.
Stack
OpenAI / Anthropic API or Together AI, datasketch, sentence-transformers
Datasets
Your own seed
Output
A 5k-example JSONL with a quality report.
How to Test
Manual rating on a 50-example sample; downstream Lab 02 finetune improves vs baseline data.
Talking Points
Synthetic-data risks (mode collapse, model bias inheritance). Why dedup matters. License implications of distillation.
Resume Bullet
"Built a 5k-example domain-specific instruction dataset via self-instruct + MinHash dedup + length/quality filters; downstream SFT showed 9-point lift over a generic dataset baseline."
1) (Conceptual) Implement reward-model pairwise loss in 20 lines. 2) Use trl.DPOTrainer. 3) Load Anthropic/hh-rlhf or Intel/orca_dpo_pairs. 4) Run DPO on the SFT model from Lab 1. 5) Eval before/after on a preference test set + MT-Bench-lite.
A DPO-trained model with measurable preference-win-rate improvement.
How to Test
Pairwise win rate vs SFT baseline > 60% on held-out preference pairs.
Talking Points
Why DPO doesn't need a separate reward model (closed-form policy from BT preferences). β controls deviation from reference. Why DPO is more stable than PPO. Compare DPO vs IPO vs KTO.
Resume Bullet
"Implemented DPO preference optimization on a Qwen2-SFT checkpoint using HH-RLHF; achieved 67% pairwise win-rate vs SFT baseline on held-out preferences with β=0.1 and a 4× lower compute footprint than PPO."
Extensions
Try IPO (handles preference noise); try KTO (works with unpaired data); analyze reward hacking.