Lab 02 — QLoRA Fine-Tune of a 7B Model (Solution Walkthrough)
Phase: 6 — Fine-tuning & Instruction Following | Difficulty: ⭐⭐⭐⭐☆ | Time: 3–6 hours (incl. training)
Concept primer:
../HITCHHIKERS-GUIDE.md§LoRA, §QLoRA, §SFT.
Run
pip install -r requirements.txt
huggingface-cli login # for gated Llama-3
python solution.py
Hardware: 24 GB GPU (RTX 3090/4090/A5000/A10). For Colab T4 (16 GB), use Qwen/Qwen2-1.5B.
0. The mission
Fine-tune Llama-3-8B (or Qwen2-7B) on a single 24 GB consumer GPU using QLoRA: 4-bit base + LoRA adapters in BF16. The fully-merged model would need ~32 GB just for weights in BF16; QLoRA reduces this to ~6 GB and trainable parameters to ~50 MB.
This is the technique that democratized LLM fine-tuning. Every "I fine-tuned a 7B model on my gaming GPU" project uses it.
1. The math
1.1 LoRA decomposition
For any linear layer $y = Wx$ with $W \in \mathbb{R}^{d \times k}$, freeze $W$ and add a low-rank update:
$$ y = Wx + BAx, \quad B \in \mathbb{R}^{d \times r}, ; A \in \mathbb{R}^{r \times k}, ; r \ll \min(d, k) $$
$A$ is initialized to random Gaussian, $B$ to zero — so $BA = 0$ at step 0 (model output is unchanged). With $r = 16$ and $d = k = 4096$, trainable params per layer drop from $16{,}777{,}216$ to $131{,}072$ (a 128× reduction).
A scalar $\alpha / r$ scales the update: $y = Wx + (\alpha / r) BAx$. Convention: $\alpha = 2r$ so the scale is 2.0.
1.2 QLoRA's three tricks
- NF4 quantization — a 4-bit data type optimized for normally-distributed weights (which neural-net weights approximately are). Quantization levels are placed at the quantiles of $\mathcal{N}(0, 1)$. Less quantization error than uniform INT4.
- Double quantization — quantize the per-block quantization constants themselves. Saves ~0.4 bits/param on top of NF4. Free.
- Paged optimizer — use NVIDIA unified memory to swap optimizer states to CPU when GPU memory pressure spikes. Lets you fine-tune without OOM crashes during memory peaks.
Backward pass dequantizes 4-bit weights to BF16 on the fly — no quality loss. Forward + backward in BF16. Optimizer (AdamW) only updates LoRA params, so optimizer state is tiny.
2. Loading the model in 4-bit
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb,
device_map="auto",
attn_implementation="flash_attention_2",
)
bnb_4bit_compute_dtype=torch.bfloat16— dequantize to BF16 for the matmul. (FP16 also works but BF16 is more stable.)device_map="auto"— transformers' accelerate-based dispatcher places layers on available GPUs.flash_attention_2— ~2× faster + much lower memory. Required for long-context fine-tuning.
model.config.use_cache = False # incompatible with grad checkpointing
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
- Gradient checkpointing — trade compute for memory. Recompute activations during backward instead of storing them. Cost: ~30% slower; benefit: ~5× less activation memory — essential for 8B at 24 GB.
prepare_model_for_kbit_training— casts LayerNorm/embedding outputs to FP32 for stability, enablesrequires_gradon input embeddings (so gradients flow back through the frozen base).
3. Attaching LoRA adapters
from peft import LoraConfig, get_peft_model
lora = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora)
model.print_trainable_parameters()
# trainable params: 41,943,040 || all params: 8,071,016,448 || trainable%: 0.52
Key choices:
r=16, alpha=32— the modal QLoRA settings.alpha = 2ris convention; some preferalpha = r(scale 1.0). Both work;2ris slightly more aggressive.- All linear layers — attention (q/k/v/o) and MLP (gate/up/down). The QLoRA paper showed that targeting all linears gives ~2 perplexity points improvement over attention-only.
lora_dropout=0.05— small dropout on the LoRA path only (frozen base unaffected). Helps when fine-tuning on small datasets.bias="none"— don't train biases. Could try"lora_only"or"all"but rarely worth it.
4. Dataset & chat template
ds = load_dataset("tatsu-lab/alpaca", split="train").select(range(2000))
def format_example(ex):
msgs = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": ex["instruction"] + ("\n\n" + ex["input"] if ex["input"] else "")},
{"role": "assistant", "content": ex["output"]},
]
return {"text": tokenizer.apply_chat_template(msgs, tokenize=False)}
ds = ds.map(format_example)
- Use the model's own chat template (
apply_chat_template). Llama-3 uses<|begin_of_text|><|start_header_id|>system<|end_header_id|>.... Qwen uses<|im_start|>system\n...<|im_end|>. Hardcoding the wrong template silently destroys quality. - 2000 examples is enough to teach instruction-following style on a base model. For domain knowledge, you need 10k+.
5. SFTTrainer setup
from trl import SFTTrainer, SFTConfig
cfg = SFTConfig(
output_dir="./qlora-out",
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
num_train_epochs=2,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.03,
bf16=True,
optim="paged_adamw_8bit", # 👈 QLoRA's paged optimizer
max_seq_length=1024,
packing=True, # concat short examples → fill seq
logging_steps=20,
save_steps=200,
report_to="none",
)
trainer = SFTTrainer(model=model, args=cfg, train_dataset=ds, dataset_text_field="text")
trainer.train()
Key choices:
learning_rate=2e-4— ~10× higher than full fine-tuning. LoRA params are randomly initialized and need bigger steps to learn.optim="paged_adamw_8bit"— the 8-bit AdamW frombitsandbyteswith paging. Keeps optimizer state at ~25% of FP32 size and survives memory spikes.packing=True— concatenates short examples to fillmax_seq_length. Eliminates padding waste. Critical for instruction datasets where most examples are <500 tokens.bf16=True— BF16 forward/backward. (FP16 with QLoRA is unstable.)warmup_ratio=0.03— first 3% of steps are linear warmup. Smaller than pretraining warmup because we're fine-tuning, not training from scratch.
6. Saving and merging
trainer.model.save_pretrained("./qlora-out/adapter")
This saves only the LoRA adapter (~50 MB). For deployment, you typically merge:
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained(base_model, torch_dtype=torch.bfloat16)
merged = PeftModel.from_pretrained(base, "./qlora-out/adapter").merge_and_unload()
merged.save_pretrained("./merged-bf16")
- Merging requires a non-quantized base — you can't merge a LoRA adapter into a 4-bit base while preserving quality. Load the base in BF16, merge, save.
- After merging, the model has the same architecture as the base (no adapter overhead at inference).
7. Expected output
trainable params: 41,943,040 || all params: 8,071,016,448 || trainable%: 0.52
{'loss': 1.4521, 'learning_rate': 6e-05, 'epoch': 0.04}
{'loss': 1.1234, 'learning_rate': 1.99e-04, 'epoch': 0.20}
...
{'loss': 0.8021, 'learning_rate': 2e-06, 'epoch': 1.99}
{'train_runtime': 5400.0, 'train_samples_per_second': 0.74}
Sanity checks:
- Loss starts near 2.0, ends near 0.8–1.0 for typical SFT data.
- VRAM usage during training: ~14–18 GB on a 24 GB card. If you OOM, lower
per_device_train_batch_sizeto 1 ormax_seq_lengthto 512. - Sample from the merged model afterward and compare to the base — the fine-tune should follow instructions in the assistant turn instead of continuing the prompt.
8. Common pitfalls
- Wrong chat template — silent quality killer. Always use
tokenizer.apply_chat_template, never hand-format. - Forgetting
model.config.use_cache = Falsewith grad checkpointing → silent slowdown + warning. load_in_8bitinstead of 4-bit — 8-bit doesn't fit 8B in 24 GB during training (only inference).flash_attention_2not installed — fall back to eager attention, doubles VRAM, halves throughput.- Training a chat model on raw text (no chat template) — you wreck the model's existing instruction-following.
- Saving the full model instead of the adapter — wastes 16 GB of disk per checkpoint.
- Merging at FP16 precision — quality loss vs BF16. Always merge in BF16.
9. Stretch exercises
- DPO on top of SFT: take your SFT'd model + a preference dataset (e.g.,
argilla/distilabel-intel-orca-dpo-pairs) and runtrl.DPOTrainer. Measure win-rate vs the SFT-only model. - Multi-LoRA serving: train two adapters on different domains; load both into one base; route at inference time.
- Compare ranks: train at r=4, 16, 64. Plot loss vs trainable params. The 4↓16 jump should be large; 16↓64 small.
- Compare full FT vs LoRA at same compute: full fine-tune a 1.5B model vs LoRA on a 7B — which is better at the same wall-clock?
- Eval with
lm-eval-harnesson MMLU/GSM8K before and after — by how much does instruction tuning hurt raw-knowledge benchmarks (the alignment tax)? - Try GaLore or DoRA as alternatives to LoRA — newer parameter-efficient methods with slightly different tradeoffs.
10. What this lab proves about you
You can stand up a production fine-tuning pipeline for a 7B+ model on consumer hardware, justify every hyperparameter (rank, alpha, target modules, optimizer choice, packing), and articulate the QLoRA tricks that make it possible. This is the bar for Phase-6 — and it's the most-demanded skill in current LLM engineering job postings.