🛸 Hitchhiker's Guide — Phase 6: Fine-Tuning & Instruction Tuning

Read this if: You can pretrain a small LM, but you don't yet know the difference between SFT, RLHF, DPO, ORPO; you've heard "LoRA" but can't write its math; or you can't explain why QLoRA lets you fine-tune 70B on a single A100.

0. The 30-second mental model

A pretrained "base" model is a calculator that loves to complete the most likely text. To turn it into a useful assistant, you do post-training in 1–3 stages:

SFT (Supervised Fine-Tuning): train on (prompt, ideal_response) pairs to teach the format and behavior. ~10k–1M examples.
Preference learning (RLHF, DPO, ORPO): align outputs with human preferences using (prompt, chosen, rejected) triplets. The model learns subtle quality, helpfulness, and refusal behaviors that are easier to prefer than to write.
(Optional) Constitutional AI / RLAIF — use an LLM to generate the preference labels at scale.

Plus a separate axis: how you fine-tune.

Full fine-tune: update every parameter. Highest quality, biggest cost (memory + storage).
LoRA (Low-Rank Adaptation): add tiny rank-r adapters; freeze base. ~100× less memory, near-equal quality.
QLoRA: LoRA on top of a 4-bit quantized base. Lets you fine-tune 70B on one A100 80GB.

By the end of Phase 6 you should:

Build an SFT dataset and run a real SFT job with HuggingFace trl's SFTTrainer.
Derive LoRA's math; explain r and α.
Configure QLoRA correctly (NF4, double-quant, paged optimizers).
Explain DPO's loss derivation from PPO's optimum.
Know when to fine-tune vs RAG vs prompt-engineer.

1. The post-training pipeline at a glance

Base model  ──SFT on demos──►  SFT model  ──preference learning──►  Aligned model
   (lossy completer)              (instruction follower)             (helpful + harmless)

Real production stacks (OpenAI, Anthropic, Llama-3): SFT on millions of demos → DPO (or RLHF) on hundreds of thousands of preferences → optional rejection sampling, constitutional AI, red-teaming, eval gates.

2. Stage 1 — Supervised Fine-Tuning (SFT)

2.1 The data

Each example is (prompt, response). Crucially, loss is computed only on the response tokens, not the prompt. The prompt is conditioning context.

Common templates:

ChatML / OpenAI format:

<|im_start|>system
You are a helpful assistant.
<|im_end|>
<|im_start|>user
Explain attention.
<|im_end|>
<|im_start|>assistant
Sure! Attention is a mechanism that...
<|im_end|>

Alpaca format:

Below is an instruction...
### Instruction:
Explain attention.
### Response:
Sure! Attention is a mechanism that...

Llama-3 format has its own special tokens.

The exact template MUST be consistent between training and inference. A common bug: training with one template, serving with another → garbled outputs.

2.2 Loss masking

Compute loss only on the assistant's tokens. Implementation: build a labels tensor identical to input_ids, then set labels[i] = -100 for every token that's part of the prompt. PyTorch's cross_entropy ignores -100.

trl's SFTTrainer does this automatically when you pass formatting_func and a response_template.

2.3 The classic SFT datasets

Alpaca (52k, GPT-3.5 generated) — historical baseline, low quality but shows the format.
Dolly-15k (Databricks, 2023) — 15k human-written; permissively licensed. Used in Lab 02.
OpenAssistant Conversations — 161k human conversations.
UltraChat — 1.5M GPT-3.5 conversations.
ShareGPT — real ChatGPT conversations.

A common pattern at frontier labs: ~100k–1M examples, with ~70% LLM-generated and ~30% human-curated/filtered.

2.4 SFT hyperparameters that matter

LR: small. ~1e-5 to 5e-5 for full fine-tune; ~1e-4 to 3e-4 for LoRA.
Epochs: 1–3. SFT overfits fast. More epochs ≠ better.
Batch size: large effective batch (64–256) via gradient accumulation.
Cosine decay with short warmup (3% of steps).

3. Parameter-Efficient Fine-Tuning (PEFT)

3.1 Why PEFT exists

A 70B model needs ~140GB for weights, ~280GB for fp32 AdamW state, ~10–50GB for activations. That's ~500GB peak — eight A100 80GBs. Most practitioners cannot afford this.

PEFT methods freeze the base and train tiny additions. The full base + adapter at inference is identical in size to the base; only the adapter (~few hundred MB) needs to be stored per fine-tune.

3.2 LoRA — Low-Rank Adaptation (Hu et al., 2021)

Key observation: empirically, fine-tuning updates ΔW to weight matrices have low intrinsic rank. So decompose ΔW as the product of two thin matrices:

$$ W_{\text{eff}} = W_0 + \Delta W = W_0 + B A $$

where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, r ≪ \min(d, k). Only A and B train; W_0 is frozen.

Forward pass:

$$ y = W_0 x + (α/r) \cdot B (A x) $$

The α/r is the LoRA scaling. Convention: α = 2r (so the scaling is 2), but it's tunable — it controls how strongly the adapter influences the output.

Parameter savings

For a d × k = 4096 × 4096 weight: full update = 16M params. LoRA r = 16: 16 × (4096 + 4096) = 131k params. 122× fewer. Apply LoRA to all attention QKV+O and MLP up/down/gate: ~7 matrices/layer × 32 layers = ~225 matrices, total adapter ≈ 30M params for a 7B model. Optimizer states for those 30M params fit in <1GB.

Initialization

A initialized with kaiming_uniform, B initialized to zero. So BA = 0 at start, the adapter is initially the identity perturbation, and the model behaves exactly like the base. Loss starts at the base model's loss; training improves from there.

Where to apply LoRA

The Hu paper applied only to W_q and W_v. Modern practice: apply to all attention and MLP projections (q_proj, k_proj, v_proj, o_proj, up_proj, gate_proj, down_proj). More targets = more adapter params = better quality. Lab 02 uses this set.

Choosing `r`

Typical: 8, 16, 32, 64. Bigger r = more capacity to fit the new task. r = 16 is a great default. For very different downstream tasks (e.g., teaching a new language), r = 64 may help.

3.3 QLoRA (Dettmers et al., 2023)

QLoRA = LoRA on top of a 4-bit quantized base model. Three innovations:

NF4 (NormalFloat-4): a 4-bit datatype whose quantization levels are chosen to be information-theoretically optimal for normal-distributed data. Pretrained weights are approximately N(0, σ), so NF4 minimizes quantization error in the relevant range. (Standard 4-bit integer quantization wastes bits on values that rarely occur.)
Double quantization: the per-block quantization scales themselves are quantized, saving another ~0.4 bits/param on average.
Paged optimizers: optimizer state pages move between GPU and CPU memory via NVIDIA's Unified Memory, avoiding OOM spikes during gradient checkpointing.

End result: fine-tune 70B on a single A100 80GB at near-equal quality to full fp16 fine-tuning. Bombshell paper.

In Lab 02 you'll set up QLoRA via:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

3.4 Other PEFT methods

Prefix tuning / Prompt tuning — train soft "virtual tokens" prepended to inputs. Older, less popular than LoRA now.
(IA)³ — scale activations by learned vectors. Tiny but limited capacity.
DoRA (Liu et al. 2024) — decomposes weight updates into magnitude + direction; small quality bump over LoRA at same r.

4. Stage 2 — Preference Learning

4.1 The data

Triplets (prompt, chosen_response, rejected_response). Sources:

Human annotators ranking pairs (most expensive, highest signal).
AI judges (RLAIF) — cheap; quality bounded by judge.
Self-rejection sampling — generate multiple, score with a reward model, keep best/worst.

4.2 RLHF (PPO) — the original recipe

Three steps:

Train a reward model r_φ(x, y): a small head on top of the SFT model that outputs a scalar. Trained on preferences with the Bradley-Terry loss: $$\mathcal{L}_{RM} = -\log \sigma(r(x, y_w) - r(x, y_l))$$ where y_w is chosen, y_l is rejected.
Use PPO (Proximal Policy Optimization, Schulman et al. 2017) to optimize the SFT model against the reward, with a KL penalty to a frozen reference (the SFT model itself): $$\mathcal{L}{RLHF} = \mathbb{E}y[r(x, y)] - β , D{KL}(\pi(\cdot|x) | \pi{\text{ref}}(\cdot|x))$$ The KL prevents the policy from drifting too far and reward-hacking.
Generate rollouts, score with reward model, run PPO updates. Repeat.

PPO is complex: ~7 hyperparams; unstable; needs distributed rollout infrastructure; reward-hacking is real (model finds adversarial paths to high reward). Cost is huge — 4× SFT cost easily.

4.3 DPO — Direct Preference Optimization (Rafailov et al., 2023)

Insight: you can derive PPO's optimal policy in closed form (assuming the KL-constrained reward objective), and inverting that derivation gives a contrastive loss directly on (chosen, rejected) pairs. No reward model. No rollouts. Just SFT-like training.

Loss:

$$ \mathcal{L}{DPO} = -\log \sigma!\left(β \log \frac{\pi(y_w | x)}{\pi{\text{ref}}(y_w | x)} - β \log \frac{\pi(y_l | x)}{\pi_{\text{ref}}(y_l | x)}\right) $$

Where π is the trainable policy and π_ref is the frozen SFT model. Intuitively: increase π's probability of chosen relative to ref, decrease for rejected.

DPO has effectively replaced PPO as the default for new projects in 2024+. Simpler, more stable, often matches or beats PPO. Llama-3 instruct uses DPO.

Christiano et al. (2017), Deep RL from Human Preferences — the first RLHF paper.
Stiennon et al. (2020), Learning to Summarize with Human Feedback — first compelling RLHF for LLMs.
Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback — InstructGPT, the basis of ChatGPT.
Schulman et al. (2017), Proximal Policy Optimization Algorithms.
Rafailov et al. (2023), Direct Preference Optimization.
Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model.
Bai et al. (2022), Constitutional AI: Harmlessness from AI Feedback.

5. The lab walkthrough (lab-02-lora-qlora)

5.1 What you'll build

Fine-tune Mistral-7B (or similar 7B base) on Dolly-15k with QLoRA:

Load 4-bit quantized base via BitsAndBytesConfig.
Configure LoRA with r=16, α=32, applied to attention QKV+O and MLP up/down/gate.
Use paged_adamw_8bit optimizer.
Train with SFTTrainer for 1 epoch, ~30 minutes on a single A100 40GB.
Save the LoRA adapter (~100MB).
Inference: load the base + adapter, generate.

5.2 Things to read carefully

The exact target_modules list — this depends on the model architecture. For Llama/Mistral: ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "gate_proj", "down_proj"].
The prepare_model_for_kbit_training() call — disables some incompatible features and casts the LM head to fp32 for numerical stability.
The formatting_func and response_template — these tell SFTTrainer how to mask labels.
The merge step (model.merge_and_unload()) — fuses adapter weights into the base for deployment.

5.3 Sanity checks

Initial loss should be the base model's loss on the format (~2–3).
Loss should drop to ~1.0–1.5 by epoch end on Dolly.
Generated responses should be grammatical and follow Dolly's tone.

6. When to fine-tune (vs RAG vs prompt)

Need	Best tool
Add new factual knowledge	RAG (most cases); fine-tune for very narrow, large, stable domains
Change output format / style	Fine-tune (small SFT)
Improve general capability	Fine-tune (DPO on preferences)
Adapt to a new language	Fine-tune (continued pretraining + SFT)
Per-tenant customization	LoRA adapters per tenant; hot-swap at serving
Personalization	Usually prompt + retrieval; rarely fine-tune
Compliance / safety	Fine-tune (RLHF/DPO with refusal data)

A common mistake: trying to fine-tune in facts that change weekly. Use RAG.

7. Common interview questions on Phase 6 material

Walk through SFT, RLHF, and DPO. When would you use each?
Derive LoRA's math. Explain r and α.
What's NF4 and why is it better than INT4?
Why does QLoRA let you fine-tune 70B on one GPU?
Compare PPO's KL penalty and DPO's reference model — they're related, how?
What's reward hacking and how do you mitigate it?
When would you fine-tune instead of RAG?
How do you mask labels for SFT so the model doesn't train on the prompt?
Sketch the Bradley-Terry reward model loss.
What's the role of α / r scaling in LoRA at inference?
How would you serve 100 different LoRA adapters in production? (Bridges to Phase 9.)
Why is constitutional AI scalable in a way RLHF isn't?

8. From solid → exceptional

Implement LoRA from scratch (no peft): wrap an nn.Linear so its forward adds a low-rank update. Confirm gradient flow only into the adapter.
Implement the DPO loss in pure PyTorch (no trl); compute against a tiny preference dataset. Verify against trl reference.
Run a side-by-side SFT vs SFT+DPO vs SFT+ORPO on the same base, evaluate with MT-Bench. Report numbers.
Implement rejection sampling with reward model: generate 16 responses per prompt, score with a separate RM, keep top-1. Compare to base sampling.
Read the Constitutional AI paper and write a one-page summary; sketch how you'd build a small CAI loop on a 1B model.
Train multiple LoRA adapters for different tasks; demonstrate hot-swapping at inference (e.g., via peft's set_adapter).

9. Recommended cadence

Day	Activity
Mon	Read Hu et al. 2021 (LoRA) + Dettmers et al. 2023 (QLoRA)
Tue	Read Ouyang et al. 2022 (InstructGPT) — skim PPO algorithm
Wed	Read Rafailov et al. 2023 (DPO) carefully; trace the derivation
Thu	Lab 02 — get QLoRA fine-tune running; save adapter
Fri	Inference with adapter; merge; compare base vs fine-tuned outputs
Sat	Implement DPO loss from scratch on a toy dataset
Sun	Mock interview the 12 questions; whiteboard LoRA

LLM Inference Engineer