Lab 02 — QLoRA Fine-Tune of a 7B Model (Solution Walkthrough)

Phase: 6 — Fine-tuning & Instruction Following | Difficulty: ⭐⭐⭐⭐☆ | Time: 3–6 hours (incl. training)

Concept primer: ../HITCHHIKERS-GUIDE.md §LoRA, §QLoRA, §SFT.

Run

pip install -r requirements.txt
huggingface-cli login   # for gated Llama-3
python solution.py

Hardware: 24 GB GPU (RTX 3090/4090/A5000/A10). For Colab T4 (16 GB), use Qwen/Qwen2-1.5B.


0. The mission

Fine-tune Llama-3-8B (or Qwen2-7B) on a single 24 GB consumer GPU using QLoRA: 4-bit base + LoRA adapters in BF16. The fully-merged model would need ~32 GB just for weights in BF16; QLoRA reduces this to ~6 GB and trainable parameters to ~50 MB.

This is the technique that democratized LLM fine-tuning. Every "I fine-tuned a 7B model on my gaming GPU" project uses it.


1. The math

1.1 LoRA decomposition

For any linear layer $y = Wx$ with $W \in \mathbb{R}^{d \times k}$, freeze $W$ and add a low-rank update:

$$ y = Wx + BAx, \quad B \in \mathbb{R}^{d \times r}, ; A \in \mathbb{R}^{r \times k}, ; r \ll \min(d, k) $$

$A$ is initialized to random Gaussian, $B$ to zero — so $BA = 0$ at step 0 (model output is unchanged). With $r = 16$ and $d = k = 4096$, trainable params per layer drop from $16{,}777{,}216$ to $131{,}072$ (a 128× reduction).

A scalar $\alpha / r$ scales the update: $y = Wx + (\alpha / r) BAx$. Convention: $\alpha = 2r$ so the scale is 2.0.

1.2 QLoRA's three tricks

  1. NF4 quantization — a 4-bit data type optimized for normally-distributed weights (which neural-net weights approximately are). Quantization levels are placed at the quantiles of $\mathcal{N}(0, 1)$. Less quantization error than uniform INT4.
  2. Double quantization — quantize the per-block quantization constants themselves. Saves ~0.4 bits/param on top of NF4. Free.
  3. Paged optimizer — use NVIDIA unified memory to swap optimizer states to CPU when GPU memory pressure spikes. Lets you fine-tune without OOM crashes during memory peaks.

Backward pass dequantizes 4-bit weights to BF16 on the fly — no quality loss. Forward + backward in BF16. Optimizer (AdamW) only updates LoRA params, so optimizer state is tiny.


2. Loading the model in 4-bit

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb,
    device_map="auto",
    attn_implementation="flash_attention_2",
)
  • bnb_4bit_compute_dtype=torch.bfloat16 — dequantize to BF16 for the matmul. (FP16 also works but BF16 is more stable.)
  • device_map="auto" — transformers' accelerate-based dispatcher places layers on available GPUs.
  • flash_attention_2 — ~2× faster + much lower memory. Required for long-context fine-tuning.
model.config.use_cache = False                  # incompatible with grad checkpointing
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
  • Gradient checkpointing — trade compute for memory. Recompute activations during backward instead of storing them. Cost: ~30% slower; benefit: ~5× less activation memory — essential for 8B at 24 GB.
  • prepare_model_for_kbit_training — casts LayerNorm/embedding outputs to FP32 for stability, enables requires_grad on input embeddings (so gradients flow back through the frozen base).

3. Attaching LoRA adapters

from peft import LoraConfig, get_peft_model

lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,071,016,448 || trainable%: 0.52

Key choices:

  • r=16, alpha=32 — the modal QLoRA settings. alpha = 2r is convention; some prefer alpha = r (scale 1.0). Both work; 2r is slightly more aggressive.
  • All linear layers — attention (q/k/v/o) and MLP (gate/up/down). The QLoRA paper showed that targeting all linears gives ~2 perplexity points improvement over attention-only.
  • lora_dropout=0.05 — small dropout on the LoRA path only (frozen base unaffected). Helps when fine-tuning on small datasets.
  • bias="none" — don't train biases. Could try "lora_only" or "all" but rarely worth it.

4. Dataset & chat template

ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(2000))

def format_example(ex):
    msgs = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": ex["instruction"] + ("\n\n" + ex["input"] if ex["input"] else "")},
        {"role": "assistant", "content": ex["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

ds = ds.map(format_example)
  • Use the model's own chat template (apply_chat_template). Llama-3 uses <|begin_of_text|><|start_header_id|>system<|end_header_id|>.... Qwen uses <|im_start|>system\n...<|im_end|>. Hardcoding the wrong template silently destroys quality.
  • 2000 examples is enough to teach instruction-following style on a base model. For domain knowledge, you need 10k+.

5. SFTTrainer setup

from trl import SFTTrainer, SFTConfig

cfg = SFTConfig(
    output_dir="./qlora-out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,                # effective batch = 16
    num_train_epochs=2,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    optim="paged_adamw_8bit",                     # 👈 QLoRA's paged optimizer
    max_seq_length=1024,
    packing=True,                                  # concat short examples → fill seq
    logging_steps=20,
    save_steps=200,
    report_to="none",
)
trainer = SFTTrainer(model=model, args=cfg, train_dataset=ds, dataset_text_field="text")
trainer.train()

Key choices:

  • learning_rate=2e-4 — ~10× higher than full fine-tuning. LoRA params are randomly initialized and need bigger steps to learn.
  • optim="paged_adamw_8bit" — the 8-bit AdamW from bitsandbytes with paging. Keeps optimizer state at ~25% of FP32 size and survives memory spikes.
  • packing=True — concatenates short examples to fill max_seq_length. Eliminates padding waste. Critical for instruction datasets where most examples are <500 tokens.
  • bf16=True — BF16 forward/backward. (FP16 with QLoRA is unstable.)
  • warmup_ratio=0.03 — first 3% of steps are linear warmup. Smaller than pretraining warmup because we're fine-tuning, not training from scratch.

6. Saving and merging

trainer.model.save_pretrained("./qlora-out/adapter")

This saves only the LoRA adapter (~50 MB). For deployment, you typically merge:

from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "./qlora-out/adapter").merge_and_unload()
merged.save_pretrained("./merged-bf16")
  • Merging requires a non-quantized base — you can't merge a LoRA adapter into a 4-bit base while preserving quality. Load the base in BF16, merge, save.
  • After merging, the model has the same architecture as the base (no adapter overhead at inference).

7. Expected output

trainable params: 41,943,040 || all params: 8,071,016,448 || trainable%: 0.52

{'loss': 1.4521, 'learning_rate': 6e-05, 'epoch': 0.04}
{'loss': 1.1234, 'learning_rate': 1.99e-04, 'epoch': 0.20}
...
{'loss': 0.8021, 'learning_rate': 2e-06, 'epoch': 1.99}
{'train_runtime': 5400.0, 'train_samples_per_second': 0.74}

Sanity checks:

  • Loss starts near 2.0, ends near 0.8–1.0 for typical SFT data.
  • VRAM usage during training: ~14–18 GB on a 24 GB card. If you OOM, lower per_device_train_batch_size to 1 or max_seq_length to 512.
  • Sample from the merged model afterward and compare to the base — the fine-tune should follow instructions in the assistant turn instead of continuing the prompt.

8. Common pitfalls

  1. Wrong chat template — silent quality killer. Always use tokenizer.apply_chat_template, never hand-format.
  2. Forgetting model.config.use_cache = False with grad checkpointing → silent slowdown + warning.
  3. load_in_8bit instead of 4-bit — 8-bit doesn't fit 8B in 24 GB during training (only inference).
  4. flash_attention_2 not installed — fall back to eager attention, doubles VRAM, halves throughput.
  5. Training a chat model on raw text (no chat template) — you wreck the model's existing instruction-following.
  6. Saving the full model instead of the adapter — wastes 16 GB of disk per checkpoint.
  7. Merging at FP16 precision — quality loss vs BF16. Always merge in BF16.

9. Stretch exercises

  • DPO on top of SFT: take your SFT'd model + a preference dataset (e.g., argilla/distilabel-intel-orca-dpo-pairs) and run trl.DPOTrainer. Measure win-rate vs the SFT-only model.
  • Multi-LoRA serving: train two adapters on different domains; load both into one base; route at inference time.
  • Compare ranks: train at r=4, 16, 64. Plot loss vs trainable params. The 4↓16 jump should be large; 16↓64 small.
  • Compare full FT vs LoRA at same compute: full fine-tune a 1.5B model vs LoRA on a 7B — which is better at the same wall-clock?
  • Eval with lm-eval-harness on MMLU/GSM8K before and after — by how much does instruction tuning hurt raw-knowledge benchmarks (the alignment tax)?
  • Try GaLore or DoRA as alternatives to LoRA — newer parameter-efficient methods with slightly different tradeoffs.

10. What this lab proves about you

You can stand up a production fine-tuning pipeline for a 7B+ model on consumer hardware, justify every hyperparameter (rank, alpha, target modules, optimizer choice, packing), and articulate the QLoRA tricks that make it possible. This is the bar for Phase-6 — and it's the most-demanded skill in current LLM engineering job postings.