Lab 03 — ResNet from Scratch
Phase 3: PyTorch | Week 8-9
ResNet solved the vanishing gradient problem that blocked deep networks for years. Understanding skip connections is non-negotiable for any CV engineer interview.
Learning Objectives
- Prove the vanishing gradient problem experimentally
- Implement BasicBlock and BottleneckBlock from the original paper
- Build ResNet-18 and ResNet-50 from scratch
- Understand BatchNorm's role in deep network training
- Compare training dynamics: plain network vs ResNet
Theory
Vanishing Gradient Problem
For a network with $L$ layers, the gradient of the loss w.r.t. weights at layer $k$:
$$\frac{\partial \mathcal{L}}{\partial W_k} = \frac{\partial \mathcal{L}}{\partial a_L} \cdot \prod_{i=k}^{L} \frac{\partial a_i}{\partial a_{i-1}}$$
If $\frac{\partial a_i}{\partial a_{i-1}} = \sigma'(z_i) W_i$ and $|\sigma'| < 1$ (sigmoid saturates), the product shrinks exponentially → gradients vanish.
ReLU helps ($\sigma'(z) = 1$ for $z > 0$), but multiplying many weight matrices still causes issues.
Residual Block — The Key Idea
Instead of learning $H(x)$ directly, learn the residual:
$$H(x) = \mathcal{F}(x) + x$$
$$\mathcal{F}(x) = H(x) - x$$
If the optimal solution is close to identity, $\mathcal{F}(x) \approx 0$ — much easier to learn than $H(x) \approx x$.
Gradient flow: $$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial H} \cdot \left(\frac{\partial \mathcal{F}}{\partial x} + 1\right)$$
The $+1$ ensures gradient is at least 1 even if $\frac{\partial \mathcal{F}}{\partial x} \approx 0$ — gradient highway.
BasicBlock (ResNet-18/34)
x → Conv3×3 → BN → ReLU → Conv3×3 → BN → (+x) → ReLU
When channels change: use 1×1 conv projection to match dimensions.
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, in_ch, out_ch, stride=1):
self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride, 1, bias=False)
self.bn1 = nn.BatchNorm2d(out_ch)
self.conv2 = nn.Conv2d(out_ch, out_ch, 3, 1, 1, bias=False)
self.bn2 = nn.BatchNorm2d(out_ch)
self.downsample = nn.Sequential(
nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
nn.BatchNorm2d(out_ch)
) if stride != 1 or in_ch != out_ch else nn.Identity()
BottleneckBlock (ResNet-50/101/152)
x → Conv1×1 (reduce) → BN → ReLU
→ Conv3×3 → BN → ReLU
→ Conv1×1 (expand) → BN → (+x) → ReLU
Why bottleneck? Reduces channels before the expensive 3×3 conv, then restores. For 256-channel input:
- BasicBlock: $256 \times 256 \times 3 \times 3 \times 2 \approx 1.2$M FLOPs
- Bottleneck: $256{\times}64{\times}1^2 + 64{\times}64{\times}3^2 + 64{\times}256{\times}1^2 \approx 70$K FLOPs
BatchNorm in Deep Networks
$$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \quad \rightarrow \quad y = \gamma \hat{x} + \beta$$
Benefits: (1) Reduces internal covariate shift. (2) Acts as regularizer. (3) Allows higher LR. (4) Makes optimization landscape smoother.
Placement: BN after Conv, before ReLU (He et al. original). Pre-activation BN (BN before Conv) is sometimes better for very deep networks.
ResNet Architecture Summary
| Model | Blocks | Params | Top-1 (ImageNet) |
|---|---|---|---|
| ResNet-18 | Basic | 11.7M | 69.8% |
| ResNet-34 | Basic | 21.8M | 73.3% |
| ResNet-50 | Bottleneck | 25.6M | 76.1% |
| ResNet-101 | Bottleneck | 44.5M | 77.4% |
| ResNet-152 | Bottleneck | 60.2M | 78.3% |
What the Lab Covers
| Function | Concept |
|---|---|
vanishing_gradient_demo() | Gradient norms per layer, plain vs ResNet |
BasicBlock | Exact paper implementation with downsample |
BottleneckBlock | Channel reduction pipeline |
ResNet18() / ResNet50() | Full architecture from scratch |
batchnorm_effect_demo() | Training stability with/without BN |
layer_activation_stats() | Mean/std of activations across depth |
Interview Questions
Q: Why doesn't a deeper plain network always perform better? A: Optimization difficulty, not expressiveness. A 56-layer plain net has higher training error than a 20-layer one (He et al. 2015). Skip connections provide direct gradient paths, enabling effective training.
Q: ResNet-50 has the same depth as VGG-16 but better accuracy. Why? A: Bottleneck blocks are computationally efficient — the 1×1 convolutions reduce/restore channels. This allows 50 layers with fewer FLOPs than VGG-16's 16 layers (3.8B vs 15.5B FLOPs).
Q: What is the role of bias=False when using BatchNorm?
A: BatchNorm has its own learned bias $\beta$. The conv bias is redundant and would be subtracted out by BN's mean normalization — so we omit it to save parameters.
Q: How does torch.compile() speed up ResNet training? A: It fuses operator kernels (e.g., Conv+BN+ReLU into one CUDA kernel), eliminating memory roundtrips between operations. Typically 10-30% speedup on A100.
Run
pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/