Lab 04 — Transfer Learning & Fine-Tuning

Phase 3: PyTorch | Week 9

Transfer learning is the single most impactful technique in practical CV. Almost every production model starts from ImageNet pretrained weights. Know every variant and tradeoff cold.


Learning Objectives

  • Understand feature extraction vs full fine-tuning vs discriminative learning rates
  • Fine-tune a pretrained ResNet-50 on a new task
  • Implement progressive unfreezing (ULMFiT-style for vision)
  • Quantify how much data you need for each transfer strategy
  • Handle domain gap between source and target datasets

Theory

Why Transfer Learning Works

ImageNet-pretrained networks learn a hierarchy of reusable features:

  • Early layers (conv1-conv3): edges, colors, textures — universal
  • Mid layers (conv4): parts, patterns — semi-universal
  • Late layers (conv5, FC): high-level semantics — task-specific

Reusing early/mid layers provides a strong initialization, especially when target data is limited.

Three Transfer Strategies

StrategyWhenTrainable ParamsData Needed
Feature extraction< 1K images, similar domainOnly new headVery little
Partial fine-tuning1K-10K imagesLast N layers + headModerate
Full fine-tuning> 10K images, different domainAll layersMore

Discriminative Learning Rates

Different layers should have different learning rates — earlier layers need less updating:

$$\eta_k = \frac{\eta_{\text{base}}}{\text{decay}^{(L-k)}}$$

Typical decay = 3×. If base LR = 1e-3: head gets 1e-3, last block gets 3.3e-4, earlier blocks get 1.1e-4, etc.

param_groups = [
    {'params': model.layer1.parameters(), 'lr': 1e-5},
    {'params': model.layer2.parameters(), 'lr': 3e-5},
    {'params': model.layer3.parameters(), 'lr': 1e-4},
    {'params': model.layer4.parameters(), 'lr': 3e-4},
    {'params': model.fc.parameters(),     'lr': 1e-3},
]
optimizer = torch.optim.AdamW(param_groups, weight_decay=1e-4)

Progressive Unfreezing

  1. Freeze all except head → train 1-2 epochs
  2. Unfreeze last block → train 1-2 epochs
  3. Unfreeze more blocks → train with lower LR
  4. Unfreeze all → fine-tune at very low LR

Prevents catastrophic forgetting of pretrained knowledge.

Domain Gap

Similar domain (ImageNet → other natural images): All strategies work.
Different domain (ImageNet → medical X-rays): Early layers still useful; fine-tune more layers.
Very different domain (ImageNet → satellite imagery): May need to fine-tune from layer1 with low LR.

Catastrophic Forgetting

When fine-tuning on a small target dataset, the model "forgets" its pretraining. Mitigations:

  • Low LR for pretrained layers
  • L2 regularization toward original weights (Elastic Weight Consolidation)
  • Progressive unfreezing
  • Mix in pretraining data during fine-tuning

What the Lab Covers

FunctionConcept
build_feature_extractor()Freeze backbone, replace head
partial_finetune()Unfreeze last N layers progressively
discriminative_lr_optimizer()Per-layer LRs
compare_transfer_vs_scratch()Convergence curves: pretrained vs random init
lr_finder()Find optimal LR range
domain_gap_experiment()Accuracy vs dataset size curves

Interview Questions

Q: When should you NOT use transfer learning? A: When your domain is very different from pretraining data AND you have a lot of data. E.g., training a medical segmentation model from scratch with 100K annotated scans can outperform ImageNet transfer. Also, if input modality differs (e.g., depth maps, multi-spectral images).

Q: How do you fine-tune efficiently on a single GPU? A: (1) Freeze backbone, train head first. (2) Use discriminative LRs. (3) Mixed precision. (4) Gradient checkpointing for large backbones. (5) Accumulate gradients if batch size is too small.

Q: What is the difference between model.eval() and torch.no_grad()? A: model.eval() changes behavior of BatchNorm (use running stats instead of batch stats) and Dropout (disable). torch.no_grad() prevents gradient computation to save memory and speed up inference. Both should be used during evaluation; only torch.no_grad() during inference-only code.

Q: You have 500 images for a new classification task. What's your approach? A: Start with ResNet-50 pretrained on ImageNet. Freeze all layers except the last FC. Train for 5-10 epochs at LR=1e-3. Then unfreeze layer4 and fine-tune at LR=1e-4 for 5 more epochs. Use aggressive augmentation (RandomHorizontalFlip, ColorJitter, RandomResizedCrop). Expected accuracy: 85-95% depending on similarity to ImageNet.


Run

pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/