Lab 04 — Transfer Learning & Fine-Tuning
Phase 3: PyTorch | Week 9
Transfer learning is the single most impactful technique in practical CV. Almost every production model starts from ImageNet pretrained weights. Know every variant and tradeoff cold.
Learning Objectives
- Understand feature extraction vs full fine-tuning vs discriminative learning rates
- Fine-tune a pretrained ResNet-50 on a new task
- Implement progressive unfreezing (ULMFiT-style for vision)
- Quantify how much data you need for each transfer strategy
- Handle domain gap between source and target datasets
Theory
Why Transfer Learning Works
ImageNet-pretrained networks learn a hierarchy of reusable features:
- Early layers (conv1-conv3): edges, colors, textures — universal
- Mid layers (conv4): parts, patterns — semi-universal
- Late layers (conv5, FC): high-level semantics — task-specific
Reusing early/mid layers provides a strong initialization, especially when target data is limited.
Three Transfer Strategies
| Strategy | When | Trainable Params | Data Needed |
|---|---|---|---|
| Feature extraction | < 1K images, similar domain | Only new head | Very little |
| Partial fine-tuning | 1K-10K images | Last N layers + head | Moderate |
| Full fine-tuning | > 10K images, different domain | All layers | More |
Discriminative Learning Rates
Different layers should have different learning rates — earlier layers need less updating:
$$\eta_k = \frac{\eta_{\text{base}}}{\text{decay}^{(L-k)}}$$
Typical decay = 3×. If base LR = 1e-3: head gets 1e-3, last block gets 3.3e-4, earlier blocks get 1.1e-4, etc.
param_groups = [
{'params': model.layer1.parameters(), 'lr': 1e-5},
{'params': model.layer2.parameters(), 'lr': 3e-5},
{'params': model.layer3.parameters(), 'lr': 1e-4},
{'params': model.layer4.parameters(), 'lr': 3e-4},
{'params': model.fc.parameters(), 'lr': 1e-3},
]
optimizer = torch.optim.AdamW(param_groups, weight_decay=1e-4)
Progressive Unfreezing
- Freeze all except head → train 1-2 epochs
- Unfreeze last block → train 1-2 epochs
- Unfreeze more blocks → train with lower LR
- Unfreeze all → fine-tune at very low LR
Prevents catastrophic forgetting of pretrained knowledge.
Domain Gap
Similar domain (ImageNet → other natural images): All strategies work.
Different domain (ImageNet → medical X-rays): Early layers still useful; fine-tune more layers.
Very different domain (ImageNet → satellite imagery): May need to fine-tune from layer1 with low LR.
Catastrophic Forgetting
When fine-tuning on a small target dataset, the model "forgets" its pretraining. Mitigations:
- Low LR for pretrained layers
- L2 regularization toward original weights (Elastic Weight Consolidation)
- Progressive unfreezing
- Mix in pretraining data during fine-tuning
What the Lab Covers
| Function | Concept |
|---|---|
build_feature_extractor() | Freeze backbone, replace head |
partial_finetune() | Unfreeze last N layers progressively |
discriminative_lr_optimizer() | Per-layer LRs |
compare_transfer_vs_scratch() | Convergence curves: pretrained vs random init |
lr_finder() | Find optimal LR range |
domain_gap_experiment() | Accuracy vs dataset size curves |
Interview Questions
Q: When should you NOT use transfer learning? A: When your domain is very different from pretraining data AND you have a lot of data. E.g., training a medical segmentation model from scratch with 100K annotated scans can outperform ImageNet transfer. Also, if input modality differs (e.g., depth maps, multi-spectral images).
Q: How do you fine-tune efficiently on a single GPU? A: (1) Freeze backbone, train head first. (2) Use discriminative LRs. (3) Mixed precision. (4) Gradient checkpointing for large backbones. (5) Accumulate gradients if batch size is too small.
Q: What is the difference between model.eval() and torch.no_grad()?
A: model.eval() changes behavior of BatchNorm (use running stats instead of batch stats) and Dropout (disable). torch.no_grad() prevents gradient computation to save memory and speed up inference. Both should be used during evaluation; only torch.no_grad() during inference-only code.
Q: You have 500 images for a new classification task. What's your approach? A: Start with ResNet-50 pretrained on ImageNet. Freeze all layers except the last FC. Train for 5-10 epochs at LR=1e-3. Then unfreeze layer4 and fine-tune at LR=1e-4 for 5 more epochs. Use aggressive augmentation (RandomHorizontalFlip, ColorJitter, RandomResizedCrop). Expected accuracy: 85-95% depending on similarity to ImageNet.
Run
pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/