🛸 Hitchhiker's Guide — Phase 6: Fine-Tuning & Instruction Tuning

Read this if: You can pretrain a small LM, but you don't yet know the difference between SFT, RLHF, DPO, ORPO; you've heard "LoRA" but can't write its math; or you can't explain why QLoRA lets you fine-tune 70B on a single A100.


0. The 30-second mental model

A pretrained "base" model is a calculator that loves to complete the most likely text. To turn it into a useful assistant, you do post-training in 1–3 stages:

  1. SFT (Supervised Fine-Tuning): train on (prompt, ideal_response) pairs to teach the format and behavior. ~10k–1M examples.
  2. Preference learning (RLHF, DPO, ORPO): align outputs with human preferences using (prompt, chosen, rejected) triplets. The model learns subtle quality, helpfulness, and refusal behaviors that are easier to prefer than to write.
  3. (Optional) Constitutional AI / RLAIF — use an LLM to generate the preference labels at scale.

Plus a separate axis: how you fine-tune.

  • Full fine-tune: update every parameter. Highest quality, biggest cost (memory + storage).
  • LoRA (Low-Rank Adaptation): add tiny rank-r adapters; freeze base. ~100× less memory, near-equal quality.
  • QLoRA: LoRA on top of a 4-bit quantized base. Lets you fine-tune 70B on one A100 80GB.

By the end of Phase 6 you should:

  • Build an SFT dataset and run a real SFT job with HuggingFace trl's SFTTrainer.
  • Derive LoRA's math; explain r and α.
  • Configure QLoRA correctly (NF4, double-quant, paged optimizers).
  • Explain DPO's loss derivation from PPO's optimum.
  • Know when to fine-tune vs RAG vs prompt-engineer.

1. The post-training pipeline at a glance

Base model  ──SFT on demos──►  SFT model  ──preference learning──►  Aligned model
   (lossy completer)              (instruction follower)             (helpful + harmless)

Real production stacks (OpenAI, Anthropic, Llama-3): SFT on millions of demos → DPO (or RLHF) on hundreds of thousands of preferences → optional rejection sampling, constitutional AI, red-teaming, eval gates.


2. Stage 1 — Supervised Fine-Tuning (SFT)

2.1 The data

Each example is (prompt, response). Crucially, loss is computed only on the response tokens, not the prompt. The prompt is conditioning context.

Common templates:

  • ChatML / OpenAI format:
    <|im_start|>system
    You are a helpful assistant.
    <|im_end|>
    <|im_start|>user
    Explain attention.
    <|im_end|>
    <|im_start|>assistant
    Sure! Attention is a mechanism that...
    <|im_end|>
    
  • Alpaca format:
    Below is an instruction...
    ### Instruction:
    Explain attention.
    ### Response:
    Sure! Attention is a mechanism that...
    
  • Llama-3 format has its own special tokens.

The exact template MUST be consistent between training and inference. A common bug: training with one template, serving with another → garbled outputs.

2.2 Loss masking

Compute loss only on the assistant's tokens. Implementation: build a labels tensor identical to input_ids, then set labels[i] = -100 for every token that's part of the prompt. PyTorch's cross_entropy ignores -100.

trl's SFTTrainer does this automatically when you pass formatting_func and a response_template.

2.3 The classic SFT datasets

  • Alpaca (52k, GPT-3.5 generated) — historical baseline, low quality but shows the format.
  • Dolly-15k (Databricks, 2023) — 15k human-written; permissively licensed. Used in Lab 02.
  • OpenAssistant Conversations — 161k human conversations.
  • UltraChat — 1.5M GPT-3.5 conversations.
  • ShareGPT — real ChatGPT conversations.

A common pattern at frontier labs: ~100k–1M examples, with ~70% LLM-generated and ~30% human-curated/filtered.

2.4 SFT hyperparameters that matter

  • LR: small. ~1e-5 to 5e-5 for full fine-tune; ~1e-4 to 3e-4 for LoRA.
  • Epochs: 1–3. SFT overfits fast. More epochs ≠ better.
  • Batch size: large effective batch (64–256) via gradient accumulation.
  • Cosine decay with short warmup (3% of steps).

3. Parameter-Efficient Fine-Tuning (PEFT)

3.1 Why PEFT exists

A 70B model needs ~140GB for weights, ~280GB for fp32 AdamW state, ~10–50GB for activations. That's ~500GB peak — eight A100 80GBs. Most practitioners cannot afford this.

PEFT methods freeze the base and train tiny additions. The full base + adapter at inference is identical in size to the base; only the adapter (~few hundred MB) needs to be stored per fine-tune.

3.2 LoRA — Low-Rank Adaptation (Hu et al., 2021)

Key observation: empirically, fine-tuning updates ΔW to weight matrices have low intrinsic rank. So decompose ΔW as the product of two thin matrices:

$$ W_{\text{eff}} = W_0 + \Delta W = W_0 + B A $$

where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, r ≪ \min(d, k). Only A and B train; W_0 is frozen.

Forward pass:

$$ y = W_0 x + (α/r) \cdot B (A x) $$

The α/r is the LoRA scaling. Convention: α = 2r (so the scaling is 2), but it's tunable — it controls how strongly the adapter influences the output.

Parameter savings

For a d × k = 4096 × 4096 weight: full update = 16M params. LoRA r = 16: 16 × (4096 + 4096) = 131k params. 122× fewer. Apply LoRA to all attention QKV+O and MLP up/down/gate: ~7 matrices/layer × 32 layers = ~225 matrices, total adapter ≈ 30M params for a 7B model. Optimizer states for those 30M params fit in <1GB.

Initialization

A initialized with kaiming_uniform, B initialized to zero. So BA = 0 at start, the adapter is initially the identity perturbation, and the model behaves exactly like the base. Loss starts at the base model's loss; training improves from there.

Where to apply LoRA

The Hu paper applied only to W_q and W_v. Modern practice: apply to all attention and MLP projections (q_proj, k_proj, v_proj, o_proj, up_proj, gate_proj, down_proj). More targets = more adapter params = better quality. Lab 02 uses this set.

Choosing r

Typical: 8, 16, 32, 64. Bigger r = more capacity to fit the new task. r = 16 is a great default. For very different downstream tasks (e.g., teaching a new language), r = 64 may help.

3.3 QLoRA (Dettmers et al., 2023)

QLoRA = LoRA on top of a 4-bit quantized base model. Three innovations:

  1. NF4 (NormalFloat-4): a 4-bit datatype whose quantization levels are chosen to be information-theoretically optimal for normal-distributed data. Pretrained weights are approximately N(0, σ), so NF4 minimizes quantization error in the relevant range. (Standard 4-bit integer quantization wastes bits on values that rarely occur.)
  2. Double quantization: the per-block quantization scales themselves are quantized, saving another ~0.4 bits/param on average.
  3. Paged optimizers: optimizer state pages move between GPU and CPU memory via NVIDIA's Unified Memory, avoiding OOM spikes during gradient checkpointing.

End result: fine-tune 70B on a single A100 80GB at near-equal quality to full fp16 fine-tuning. Bombshell paper.

In Lab 02 you'll set up QLoRA via:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)

3.4 Other PEFT methods

  • Prefix tuning / Prompt tuning — train soft "virtual tokens" prepended to inputs. Older, less popular than LoRA now.
  • (IA)³ — scale activations by learned vectors. Tiny but limited capacity.
  • DoRA (Liu et al. 2024) — decomposes weight updates into magnitude + direction; small quality bump over LoRA at same r.

4. Stage 2 — Preference Learning

4.1 The data

Triplets (prompt, chosen_response, rejected_response). Sources:

  • Human annotators ranking pairs (most expensive, highest signal).
  • AI judges (RLAIF) — cheap; quality bounded by judge.
  • Self-rejection sampling — generate multiple, score with a reward model, keep best/worst.

4.2 RLHF (PPO) — the original recipe

Three steps:

  1. Train a reward model r_φ(x, y): a small head on top of the SFT model that outputs a scalar. Trained on preferences with the Bradley-Terry loss: $$\mathcal{L}_{RM} = -\log \sigma(r(x, y_w) - r(x, y_l))$$ where y_w is chosen, y_l is rejected.
  2. Use PPO (Proximal Policy Optimization, Schulman et al. 2017) to optimize the SFT model against the reward, with a KL penalty to a frozen reference (the SFT model itself): $$\mathcal{L}{RLHF} = \mathbb{E}y[r(x, y)] - β , D{KL}(\pi(\cdot|x) | \pi{\text{ref}}(\cdot|x))$$ The KL prevents the policy from drifting too far and reward-hacking.
  3. Generate rollouts, score with reward model, run PPO updates. Repeat.

PPO is complex: ~7 hyperparams; unstable; needs distributed rollout infrastructure; reward-hacking is real (model finds adversarial paths to high reward). Cost is huge — 4× SFT cost easily.

4.3 DPO — Direct Preference Optimization (Rafailov et al., 2023)

Insight: you can derive PPO's optimal policy in closed form (assuming the KL-constrained reward objective), and inverting that derivation gives a contrastive loss directly on (chosen, rejected) pairs. No reward model. No rollouts. Just SFT-like training.

Loss:

$$ \mathcal{L}{DPO} = -\log \sigma!\left(β \log \frac{\pi(y_w | x)}{\pi{\text{ref}}(y_w | x)} - β \log \frac{\pi(y_l | x)}{\pi_{\text{ref}}(y_l | x)}\right) $$

Where π is the trainable policy and π_ref is the frozen SFT model. Intuitively: increase π's probability of chosen relative to ref, decrease for rejected.

DPO has effectively replaced PPO as the default for new projects in 2024+. Simpler, more stable, often matches or beats PPO. Llama-3 instruct uses DPO.

4.4 ORPO (Hong et al., 2024)

Combines SFT and preference learning in a single stage. Loss = standard cross-entropy on chosen + odds-ratio penalty against rejected. Skips the SFT-then-DPO sequence; one-shot post-training.

4.5 Constitutional AI / RLAIF (Bai et al., 2022 — Anthropic)

Use an LLM to critique and revise outputs against a written "constitution" of principles, generating preference pairs at scale without humans. Anthropic's main alignment recipe.

4.6 References

  • Christiano et al. (2017), Deep RL from Human Preferences — the first RLHF paper.
  • Stiennon et al. (2020), Learning to Summarize with Human Feedback — first compelling RLHF for LLMs.
  • Ouyang et al. (2022), Training Language Models to Follow Instructions with Human Feedback — InstructGPT, the basis of ChatGPT.
  • Schulman et al. (2017), Proximal Policy Optimization Algorithms.
  • Rafailov et al. (2023), Direct Preference Optimization.
  • Hong et al. (2024), ORPO: Monolithic Preference Optimization without Reference Model.
  • Bai et al. (2022), Constitutional AI: Harmlessness from AI Feedback.

5. The lab walkthrough (lab-02-lora-qlora)

5.1 What you'll build

Fine-tune Mistral-7B (or similar 7B base) on Dolly-15k with QLoRA:

  • Load 4-bit quantized base via BitsAndBytesConfig.
  • Configure LoRA with r=16, α=32, applied to attention QKV+O and MLP up/down/gate.
  • Use paged_adamw_8bit optimizer.
  • Train with SFTTrainer for 1 epoch, ~30 minutes on a single A100 40GB.
  • Save the LoRA adapter (~100MB).
  • Inference: load the base + adapter, generate.

5.2 Things to read carefully

  • The exact target_modules list — this depends on the model architecture. For Llama/Mistral: ["q_proj", "k_proj", "v_proj", "o_proj", "up_proj", "gate_proj", "down_proj"].
  • The prepare_model_for_kbit_training() call — disables some incompatible features and casts the LM head to fp32 for numerical stability.
  • The formatting_func and response_template — these tell SFTTrainer how to mask labels.
  • The merge step (model.merge_and_unload()) — fuses adapter weights into the base for deployment.

5.3 Sanity checks

  • Initial loss should be the base model's loss on the format (~2–3).
  • Loss should drop to ~1.0–1.5 by epoch end on Dolly.
  • Generated responses should be grammatical and follow Dolly's tone.

6. When to fine-tune (vs RAG vs prompt)

NeedBest tool
Add new factual knowledgeRAG (most cases); fine-tune for very narrow, large, stable domains
Change output format / styleFine-tune (small SFT)
Improve general capabilityFine-tune (DPO on preferences)
Adapt to a new languageFine-tune (continued pretraining + SFT)
Per-tenant customizationLoRA adapters per tenant; hot-swap at serving
PersonalizationUsually prompt + retrieval; rarely fine-tune
Compliance / safetyFine-tune (RLHF/DPO with refusal data)

A common mistake: trying to fine-tune in facts that change weekly. Use RAG.


7. Common interview questions on Phase 6 material

  1. Walk through SFT, RLHF, and DPO. When would you use each?
  2. Derive LoRA's math. Explain r and α.
  3. What's NF4 and why is it better than INT4?
  4. Why does QLoRA let you fine-tune 70B on one GPU?
  5. Compare PPO's KL penalty and DPO's reference model — they're related, how?
  6. What's reward hacking and how do you mitigate it?
  7. When would you fine-tune instead of RAG?
  8. How do you mask labels for SFT so the model doesn't train on the prompt?
  9. Sketch the Bradley-Terry reward model loss.
  10. What's the role of α / r scaling in LoRA at inference?
  11. How would you serve 100 different LoRA adapters in production? (Bridges to Phase 9.)
  12. Why is constitutional AI scalable in a way RLHF isn't?

8. From solid → exceptional

  • Implement LoRA from scratch (no peft): wrap an nn.Linear so its forward adds a low-rank update. Confirm gradient flow only into the adapter.
  • Implement the DPO loss in pure PyTorch (no trl); compute against a tiny preference dataset. Verify against trl reference.
  • Run a side-by-side SFT vs SFT+DPO vs SFT+ORPO on the same base, evaluate with MT-Bench. Report numbers.
  • Implement rejection sampling with reward model: generate 16 responses per prompt, score with a separate RM, keep top-1. Compare to base sampling.
  • Read the Constitutional AI paper and write a one-page summary; sketch how you'd build a small CAI loop on a 1B model.
  • Train multiple LoRA adapters for different tasks; demonstrate hot-swapping at inference (e.g., via peft's set_adapter).

DayActivity
MonRead Hu et al. 2021 (LoRA) + Dettmers et al. 2023 (QLoRA)
TueRead Ouyang et al. 2022 (InstructGPT) — skim PPO algorithm
WedRead Rafailov et al. 2023 (DPO) carefully; trace the derivation
ThuLab 02 — get QLoRA fine-tune running; save adapter
FriInference with adapter; merge; compare base vs fine-tuned outputs
SatImplement DPO loss from scratch on a toy dataset
SunMock interview the 12 questions; whiteboard LoRA