🛸 Hitchhiker's Guide — Phase 6: Fine-Tuning & Instruction Tuning
Read this if: You can pretrain a small LM, but you don't yet know the difference between SFT, RLHF, DPO, ORPO; you've heard "LoRA" but can't write its math; or you can't explain why QLoRA lets you fine-tune 70B on a single A100.
0. The 30-second mental model
A pretrained "base" model is a calculator that loves to complete the most likely text. To turn it into a useful assistant, you do post-training in 1–3 stages:
- SFT (Supervised Fine-Tuning): train on
(prompt, ideal_response)pairs to teach the format and behavior. ~10k–1M examples. - Preference learning (RLHF, DPO, ORPO): align outputs with human preferences using
(prompt, chosen, rejected)triplets. The model learns subtle quality, helpfulness, and refusal behaviors that are easier to prefer than to write. - (Optional) Constitutional AI / RLAIF — use an LLM to generate the preference labels at scale.
Plus a separate axis: how you fine-tune.
- Full fine-tune: update every parameter. Highest quality, biggest cost (memory + storage).
- LoRA (Low-Rank Adaptation): add tiny rank-
radapters; freeze base. ~100× less memory, near-equal quality. - QLoRA: LoRA on top of a 4-bit quantized base. Lets you fine-tune 70B on one A100 80GB.
By the end of Phase 6 you should:
- Build an SFT dataset and run a real SFT job with HuggingFace
trl'sSFTTrainer. - Derive LoRA's math; explain
randα. - Configure QLoRA correctly (NF4, double-quant, paged optimizers).
- Explain DPO's loss derivation from PPO's optimum.
- Know when to fine-tune vs RAG vs prompt-engineer.
1. The post-training pipeline at a glance
Base model ──SFT on demos──► SFT model ──preference learning──► Aligned model
(lossy completer) (instruction follower) (helpful + harmless)
Real production stacks (OpenAI, Anthropic, Llama-3): SFT on millions of demos → DPO (or RLHF) on hundreds of thousands of preferences → optional rejection sampling, constitutional AI, red-teaming, eval gates.
2. Stage 1 — Supervised Fine-Tuning (SFT)
2.1 The data
Each example is (prompt, response). Crucially, loss is computed only on the response tokens, not the prompt. The prompt is conditioning context.
Common templates:
- ChatML / OpenAI format:
<|im_start|>system You are a helpful assistant. <|im_end|> <|im_start|>user Explain attention. <|im_end|> <|im_start|>assistant Sure! Attention is a mechanism that... <|im_end|> - Alpaca format:
Below is an instruction... ### Instruction: Explain attention. ### Response: Sure! Attention is a mechanism that... - Llama-3 format has its own special tokens.
The exact template MUST be consistent between training and inference. A common bug: training with one template, serving with another → garbled outputs.
2.2 Loss masking
Compute loss only on the assistant's tokens. Implementation: build a labels tensor identical to input_ids, then set labels[i] = -100 for every token that's part of the prompt. PyTorch's cross_entropy ignores -100.
trl's SFTTrainer does this automatically when you pass formatting_func and a response_template.
2.3 The classic SFT datasets
- Alpaca (52k, GPT-3.5 generated) — historical baseline, low quality but shows the format.
- Dolly-15k (Databricks, 2023) — 15k human-written; permissively licensed. Used in Lab 02.
- OpenAssistant Conversations — 161k human conversations.
- UltraChat — 1.5M GPT-3.5 conversations.
- ShareGPT — real ChatGPT conversations.
A common pattern at frontier labs: ~100k–1M examples, with ~70% LLM-generated and ~30% human-curated/filtered.
2.4 SFT hyperparameters that matter
- LR: small. ~1e-5 to 5e-5 for full fine-tune; ~1e-4 to 3e-4 for LoRA.
- Epochs: 1–3. SFT overfits fast. More epochs ≠ better.
- Batch size: large effective batch (64–256) via gradient accumulation.
- Cosine decay with short warmup (3% of steps).
3. Parameter-Efficient Fine-Tuning (PEFT)
3.1 Why PEFT exists
A 70B model needs ~140GB for weights, ~280GB for fp32 AdamW state, ~10–50GB for activations. That's ~500GB peak — eight A100 80GBs. Most practitioners cannot afford this.
PEFT methods freeze the base and train tiny additions. The full base + adapter at inference is identical in size to the base; only the adapter (~few hundred MB) needs to be stored per fine-tune.
3.2 LoRA — Low-Rank Adaptation (Hu et al., 2021)
Key observation: empirically, fine-tuning updates ΔW to weight matrices have low intrinsic rank. So decompose ΔW as the product of two thin matrices:
$$ W_{\text{eff}} = W_0 + \Delta W = W_0 + B A $$
where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, r ≪ \min(d, k). Only A and B train; W_0 is frozen.
Forward pass:
$$ y = W_0 x + (α/r) \cdot B (A x) $$
The α/r is the LoRA scaling. Convention: α = 2r (so the scaling is 2), but it's tunable — it controls how strongly the adapter influences the output.
Parameter savings
For a d × k = 4096 × 4096 weight: full update = 16M params. LoRA r = 16: 16 × (4096 + 4096) = 131k params. 122× fewer. Apply LoRA to all attention QKV+O and MLP up/down/gate: ~7 matrices/layer × 32 layers = ~225 matrices, total adapter ≈ 30M params for a 7B model. Optimizer states for those 30M params fit in <1GB.
Initialization
A initialized with kaiming_uniform, B initialized to zero. So BA = 0 at start, the adapter is initially the identity perturbation, and the model behaves exactly like the base. Loss starts at the base model's loss; training improves from there.
Where to apply LoRA
The Hu paper applied only to W_q and W_v. Modern practice: apply to all attention and MLP projections (q_proj, k_proj, v_proj, o_proj, up_proj, gate_proj, down_proj). More targets = more adapter params = better quality. Lab 02 uses this set.
Choosing r
Typical: 8, 16, 32, 64. Bigger r = more capacity to fit the new task. r = 16 is a great default. For very different downstream tasks (e.g., teaching a new language), r = 64 may help.
3.3 QLoRA (Dettmers et al., 2023)
QLoRA = LoRA on top of a 4-bit quantized base model. Three innovations:
- NF4 (NormalFloat-4): a 4-bit datatype whose quantization levels are chosen to be information-theoretically optimal for normal-distributed data. Pretrained weights are approximately
N(0, σ), so NF4 minimizes quantization error in the relevant range. (Standard 4-bit integer quantization wastes bits on values that rarely occur.) - Double quantization: the per-block quantization scales themselves are quantized, saving another ~0.4 bits/param on average.
- Paged optimizers: optimizer state pages move between GPU and CPU memory via NVIDIA's Unified Memory, avoiding OOM spikes during gradient checkpointing.
End result: fine-tune 70B on a single A100 80GB at near-equal quality to full fp16 fine-tuning. Bombshell paper.
In Lab 02 you'll set up QLoRA via:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
3.4 Other PEFT methods
- Prefix tuning / Prompt tuning — train soft "virtual tokens" prepended to inputs. Older, less popular than LoRA now.
- (IA)³ — scale activations by learned vectors. Tiny but limited capacity.
- DoRA (Liu et al. 2024) — decomposes weight updates into magnitude + direction; small quality bump over LoRA at same
r.
4. Stage 2 — Preference Learning
4.1 The data
Triplets (prompt, chosen_response, rejected_response). Sources:
- Human annotators ranking pairs (most expensive, highest signal).
- AI judges (RLAIF) — cheap; quality bounded by judge.
- Self-rejection sampling — generate multiple, score with a reward model, keep best/worst.
4.2 RLHF (PPO) — the original recipe
Three steps:
- Train a reward model
r_φ(x, y): a small head on top of the SFT model that outputs a scalar. Trained on preferences with the Bradley-Terry loss: $$\mathcal{L}_{RM} = -\log \sigma(r(x, y_w) - r(x, y_l))$$ wherey_wis chosen,y_lis rejected. - Use PPO (Proximal Policy Optimization, Schulman et al. 2017) to optimize the SFT model against the reward, with a KL penalty to a frozen reference (the SFT model itself): $$\mathcal{L}{RLHF} = \mathbb{E}y[r(x, y)] - β , D{KL}(\pi(\cdot|x) | \pi{\text{ref}}(\cdot|x))$$ The KL prevents the policy from drifting too far and reward-hacking.
- Generate rollouts, score with reward model, run PPO updates. Repeat.
PPO is complex: ~7 hyperparams; unstable; needs distributed rollout infrastructure; reward-hacking is real (model finds adversarial paths to high reward). Cost is huge — 4× SFT cost easily.
4.3 DPO — Direct Preference Optimization (Rafailov et al., 2023)
Insight: you can derive PPO's optimal policy in closed form (assuming the KL-constrained reward objective), and inverting that derivation gives a contrastive loss directly on (chosen, rejected) pairs. No reward model. No rollouts. Just SFT-like training.
Loss:
$$ \mathcal{L}{DPO} = -\log \sigma!\left(β \log \frac{\pi(y_w | x)}{\pi{\text{ref}}(y_w | x)} - β \log \frac{\pi(y_l | x)}{\pi_{\text{ref}}(y_l | x)}\right) $$
Where π is the trainable policy and π_ref is the frozen SFT model. Intuitively: increase π's probability of chosen relative to ref, decrease for rejected.
DPO has effectively replaced PPO as the default for new projects in 2024+. Simpler, more stable, often matches or beats PPO. Llama-3 instruct uses DPO.
4.4 ORPO (Hong et al., 2024)
Combines SFT and preference learning in a single stage. Loss = standard cross-entropy on chosen + odds-ratio penalty against rejected. Skips the SFT-then-DPO sequence; one-shot post-training.
4.5 Constitutional AI / RLAIF (Bai et al., 2022 — Anthropic)
Use an LLM to critique and revise outputs against a written "constitution" of principles, generating preference pairs at scale without humans. Anthropic's main alignment recipe.
4.6 References
- Christiano et al. (2017), Deep RL from Human Preferences — the first RLHF paper.
- Stiennon et al. (2020), Learning to Summarize with Human Feedback — first compelling RLHF for LLMs.
- Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback — InstructGPT, the basis of ChatGPT.
- Schulman et al. (2017), Proximal Policy Optimization Algorithms.
- Rafailov et al. (2023), Direct Preference Optimization.
- Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model.
- Bai et al. (2022), Constitutional AI: Harmlessness from AI Feedback.
5. The lab walkthrough (lab-02-lora-qlora)
5.1 What you'll build
Fine-tune Mistral-7B (or similar 7B base) on Dolly-15k with QLoRA:
- Load 4-bit quantized base via
BitsAndBytesConfig. - Configure LoRA with
r=16, α=32, applied to attention QKV+O and MLP up/down/gate. - Use
paged_adamw_8bitoptimizer. - Train with
SFTTrainerfor 1 epoch, ~30 minutes on a single A100 40GB. - Save the LoRA adapter (~100MB).
- Inference: load the base + adapter, generate.
5.2 Things to read carefully
- The exact
target_moduleslist — this depends on the model architecture. For Llama/Mistral:["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "gate_proj", "down_proj"]. - The
prepare_model_for_kbit_training()call — disables some incompatible features and casts the LM head to fp32 for numerical stability. - The
formatting_funcandresponse_template— these tellSFTTrainerhow to mask labels. - The merge step (
model.merge_and_unload()) — fuses adapter weights into the base for deployment.
5.3 Sanity checks
- Initial loss should be the base model's loss on the format (~2–3).
- Loss should drop to ~1.0–1.5 by epoch end on Dolly.
- Generated responses should be grammatical and follow Dolly's tone.
6. When to fine-tune (vs RAG vs prompt)
| Need | Best tool |
|---|---|
| Add new factual knowledge | RAG (most cases); fine-tune for very narrow, large, stable domains |
| Change output format / style | Fine-tune (small SFT) |
| Improve general capability | Fine-tune (DPO on preferences) |
| Adapt to a new language | Fine-tune (continued pretraining + SFT) |
| Per-tenant customization | LoRA adapters per tenant; hot-swap at serving |
| Personalization | Usually prompt + retrieval; rarely fine-tune |
| Compliance / safety | Fine-tune (RLHF/DPO with refusal data) |
A common mistake: trying to fine-tune in facts that change weekly. Use RAG.
7. Common interview questions on Phase 6 material
- Walk through SFT, RLHF, and DPO. When would you use each?
- Derive LoRA's math. Explain
randα. - What's NF4 and why is it better than INT4?
- Why does QLoRA let you fine-tune 70B on one GPU?
- Compare PPO's KL penalty and DPO's reference model — they're related, how?
- What's reward hacking and how do you mitigate it?
- When would you fine-tune instead of RAG?
- How do you mask labels for SFT so the model doesn't train on the prompt?
- Sketch the Bradley-Terry reward model loss.
- What's the role of
α / rscaling in LoRA at inference? - How would you serve 100 different LoRA adapters in production? (Bridges to Phase 9.)
- Why is constitutional AI scalable in a way RLHF isn't?
8. From solid → exceptional
- Implement LoRA from scratch (no
peft): wrap annn.Linearso its forward adds a low-rank update. Confirm gradient flow only into the adapter. - Implement the DPO loss in pure PyTorch (no
trl); compute against a tiny preference dataset. Verify against trl reference. - Run a side-by-side SFT vs SFT+DPO vs SFT+ORPO on the same base, evaluate with MT-Bench. Report numbers.
- Implement rejection sampling with reward model: generate 16 responses per prompt, score with a separate RM, keep top-1. Compare to base sampling.
- Read the Constitutional AI paper and write a one-page summary; sketch how you'd build a small CAI loop on a 1B model.
- Train multiple LoRA adapters for different tasks; demonstrate hot-swapping at inference (e.g., via
peft'sset_adapter).
9. Recommended cadence
| Day | Activity |
|---|---|
| Mon | Read Hu et al. 2021 (LoRA) + Dettmers et al. 2023 (QLoRA) |
| Tue | Read Ouyang et al. 2022 (InstructGPT) — skim PPO algorithm |
| Wed | Read Rafailov et al. 2023 (DPO) carefully; trace the derivation |
| Thu | Lab 02 — get QLoRA fine-tune running; save adapter |
| Fri | Inference with adapter; merge; compare base vs fine-tuned outputs |
| Sat | Implement DPO loss from scratch on a toy dataset |
| Sun | Mock interview the 12 questions; whiteboard LoRA |