Lab 02 — QLoRA Fine-Tune of a 7B Model (Solution Walkthrough)

Phase: 6 — Fine-tuning & Instruction Following | Difficulty: ⭐⭐⭐⭐☆ | Time: 3–6 hours (incl. training)

Concept primer: ../HITCHHIKERS-GUIDE.md §LoRA, §QLoRA, §SFT.

Run

pip install -r requirements.txt
huggingface-cli login   # for gated Llama-3
python solution.py

Hardware: 24 GB GPU (RTX 3090/4090/A5000/A10). For Colab T4 (16 GB), use Qwen/Qwen2-1.5B.

Fine-tune Llama-3-8B (or Qwen2-7B) on a single 24 GB consumer GPU using QLoRA: 4-bit base + LoRA adapters in BF16. The fully-merged model would need ~32 GB just for weights in BF16; QLoRA reduces this to ~6 GB and trainable parameters to ~50 MB.

This is the technique that democratized LLM fine-tuning. Every "I fine-tuned a 7B model on my gaming GPU" project uses it.

1. The math

1.1 LoRA decomposition

For any linear layer $y = Wx$ with $W \in \mathbb{R}^{d \times k}$, freeze $W$ and add a low-rank update:

$$ y = Wx + BAx, \quad B \in \mathbb{R}^{d \times r}, ; A \in \mathbb{R}^{r \times k}, ; r \ll \min(d, k) $$

$A$ is initialized to random Gaussian, $B$ to zero — so $BA = 0$ at step 0 (model output is unchanged). With $r = 16$ and $d = k = 4096$, trainable params per layer drop from $16{,}777{,}216$ to $131{,}072$ (a 128× reduction).

A scalar $\alpha / r$ scales the update: $y = Wx + (\alpha / r) BAx$. Convention: $\alpha = 2r$ so the scale is 2.0.

1.2 QLoRA's three tricks

NF4 quantization — a 4-bit data type optimized for normally-distributed weights (which neural-net weights approximately are). Quantization levels are placed at the quantiles of $\mathcal{N}(0, 1)$. Less quantization error than uniform INT4.
Double quantization — quantize the per-block quantization constants themselves. Saves ~0.4 bits/param on top of NF4. Free.
Paged optimizer — use NVIDIA unified memory to swap optimizer states to CPU when GPU memory pressure spikes. Lets you fine-tune without OOM crashes during memory peaks.

Backward pass dequantizes 4-bit weights to BF16 on the fly — no quality loss. Forward + backward in BF16. Optimizer (AdamW) only updates LoRA params, so optimizer state is tiny.

2. Loading the model in 4-bit

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb,
    device_map="auto",
    attn_implementation="flash_attention_2",
)

bnb_4bit_compute_dtype=torch.bfloat16 — dequantize to BF16 for the matmul. (FP16 also works but BF16 is more stable.)
device_map="auto" — transformers' accelerate-based dispatcher places layers on available GPUs.
flash_attention_2 — ~2× faster + much lower memory. Required for long-context fine-tuning.

model.config.use_cache = False                  # incompatible with grad checkpointing
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

Gradient checkpointing — trade compute for memory. Recompute activations during backward instead of storing them. Cost: ~30% slower; benefit: ~5× less activation memory — essential for 8B at 24 GB.
prepare_model_for_kbit_training — casts LayerNorm/embedding outputs to FP32 for stability, enables requires_grad on input embeddings (so gradients flow back through the frozen base).

3. Attaching LoRA adapters

from peft import LoraConfig, get_peft_model

lora = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,071,016,448 || trainable%: 0.52

Key choices:

r=16, alpha=32 — the modal QLoRA settings. alpha = 2r is convention; some prefer alpha = r (scale 1.0). Both work; 2r is slightly more aggressive.
All linear layers — attention (q/k/v/o) and MLP (gate/up/down). The QLoRA paper showed that targeting all linears gives ~2 perplexity points improvement over attention-only.
lora_dropout=0.05 — small dropout on the LoRA path only (frozen base unaffected). Helps when fine-tuning on small datasets.
bias="none" — don't train biases. Could try "lora_only" or "all" but rarely worth it.

4. Dataset & chat template

ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(2000))

def format_example(ex):
    msgs = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": ex["instruction"] + ("\n\n" + ex["input"] if ex["input"] else "")},
        {"role": "assistant", "content": ex["output"]},
    ]
    return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}

ds = ds.map(format_example)

Use the model's own chat template (apply_chat_template). Llama-3 uses <|begin_of_text|><|start_header_id|>system<|end_header_id|>.... Qwen uses <|im_start|>system\n...<|im_end|>. Hardcoding the wrong template silently destroys quality.
2000 examples is enough to teach instruction-following style on a base model. For domain knowledge, you need 10k+.

5. SFTTrainer setup

from trl import SFTTrainer, SFTConfig

cfg = SFTConfig(
    output_dir="./qlora-out",
    per_device_train_batch_size=2,
    gradient_accumulation_steps=8,                # effective batch = 16
    num_train_epochs=2,
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.03,
    bf16=True,
    optim="paged_adamw_8bit",                     # 👈 QLoRA's paged optimizer
    max_seq_length=1024,
    packing=True,                                  # concat short examples → fill seq
    logging_steps=20,
    save_steps=200,
    report_to="none",
)
trainer = SFTTrainer(model=model, args=cfg, train_dataset=ds, dataset_text_field="text")
trainer.train()

Key choices:

learning_rate=2e-4 — ~10× higher than full fine-tuning. LoRA params are randomly initialized and need bigger steps to learn.
optim="paged_adamw_8bit" — the 8-bit AdamW from bitsandbytes with paging. Keeps optimizer state at ~25% of FP32 size and survives memory spikes.
packing=True — concatenates short examples to fill max_seq_length. Eliminates padding waste. Critical for instruction datasets where most examples are <500 tokens.
bf16=True — BF16 forward/backward. (FP16 with QLoRA is unstable.)
warmup_ratio=0.03 — first 3% of steps are linear warmup. Smaller than pretraining warmup because we're fine-tuning, not training from scratch.

6. Saving and merging

trainer.model.save_pretrained("./qlora-out/adapter")

This saves only the LoRA adapter (~50 MB). For deployment, you typically merge:

from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "./qlora-out/adapter").merge_and_unload()
merged.save_pretrained("./merged-bf16")

Merging requires a non-quantized base — you can't merge a LoRA adapter into a 4-bit base while preserving quality. Load the base in BF16, merge, save.
After merging, the model has the same architecture as the base (no adapter overhead at inference).

7. Expected output

trainable params: 41,943,040 || all params: 8,071,016,448 || trainable%: 0.52

{'loss': 1.4521, 'learning_rate': 6e-05, 'epoch': 0.04}
{'loss': 1.1234, 'learning_rate': 1.99e-04, 'epoch': 0.20}
...
{'loss': 0.8021, 'learning_rate': 2e-06, 'epoch': 1.99}
{'train_runtime': 5400.0, 'train_samples_per_second': 0.74}

Sanity checks:

Loss starts near 2.0, ends near 0.8–1.0 for typical SFT data.
VRAM usage during training: ~14–18 GB on a 24 GB card. If you OOM, lower per_device_train_batch_size to 1 or max_seq_length to 512.
Sample from the merged model afterward and compare to the base — the fine-tune should follow instructions in the assistant turn instead of continuing the prompt.

8. Common pitfalls

Wrong chat template — silent quality killer. Always use tokenizer.apply_chat_template, never hand-format.
Forgetting model.config.use_cache = False with grad checkpointing → silent slowdown + warning.
load_in_8bit instead of 4-bit — 8-bit doesn't fit 8B in 24 GB during training (only inference).
flash_attention_2 not installed — fall back to eager attention, doubles VRAM, halves throughput.
Training a chat model on raw text (no chat template) — you wreck the model's existing instruction-following.
Saving the full model instead of the adapter — wastes 16 GB of disk per checkpoint.
Merging at FP16 precision — quality loss vs BF16. Always merge in BF16.

9. Stretch exercises

DPO on top of SFT: take your SFT'd model + a preference dataset (e.g., argilla/distilabel-intel-orca-dpo-pairs) and run trl.DPOTrainer. Measure win-rate vs the SFT-only model.
Multi-LoRA serving: train two adapters on different domains; load both into one base; route at inference time.
Compare ranks: train at r=4, 16, 64. Plot loss vs trainable params. The 4↓16 jump should be large; 16↓64 small.
Compare full FT vs LoRA at same compute: full fine-tune a 1.5B model vs LoRA on a 7B — which is better at the same wall-clock?
Eval with lm-eval-harness on MMLU/GSM8K before and after — by how much does instruction tuning hurt raw-knowledge benchmarks (the alignment tax)?
Try GaLore or DoRA as alternatives to LoRA — newer parameter-efficient methods with slightly different tradeoffs.

10. What this lab proves about you

You can stand up a production fine-tuning pipeline for a 7B+ model on consumer hardware, justify every hyperparameter (rank, alpha, target modules, optimizer choice, packing), and articulate the QLoRA tricks that make it possible. This is the bar for Phase-6 — and it's the most-demanded skill in current LLM engineering job postings.

LLM Inference Engineer