Lab 03 — ResNet from Scratch

Phase 3: PyTorch | Week 8-9

ResNet solved the vanishing gradient problem that blocked deep networks for years. Understanding skip connections is non-negotiable for any CV engineer interview.


Learning Objectives

  • Prove the vanishing gradient problem experimentally
  • Implement BasicBlock and BottleneckBlock from the original paper
  • Build ResNet-18 and ResNet-50 from scratch
  • Understand BatchNorm's role in deep network training
  • Compare training dynamics: plain network vs ResNet

Theory

Vanishing Gradient Problem

For a network with $L$ layers, the gradient of the loss w.r.t. weights at layer $k$:

$$\frac{\partial \mathcal{L}}{\partial W_k} = \frac{\partial \mathcal{L}}{\partial a_L} \cdot \prod_{i=k}^{L} \frac{\partial a_i}{\partial a_{i-1}}$$

If $\frac{\partial a_i}{\partial a_{i-1}} = \sigma'(z_i) W_i$ and $|\sigma'| < 1$ (sigmoid saturates), the product shrinks exponentially → gradients vanish.

ReLU helps ($\sigma'(z) = 1$ for $z > 0$), but multiplying many weight matrices still causes issues.

Residual Block — The Key Idea

Instead of learning $H(x)$ directly, learn the residual:

$$H(x) = \mathcal{F}(x) + x$$

$$\mathcal{F}(x) = H(x) - x$$

If the optimal solution is close to identity, $\mathcal{F}(x) \approx 0$ — much easier to learn than $H(x) \approx x$.

Gradient flow: $$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial H} \cdot \left(\frac{\partial \mathcal{F}}{\partial x} + 1\right)$$

The $+1$ ensures gradient is at least 1 even if $\frac{\partial \mathcal{F}}{\partial x} \approx 0$ — gradient highway.

BasicBlock (ResNet-18/34)

x → Conv3×3 → BN → ReLU → Conv3×3 → BN → (+x) → ReLU

When channels change: use 1×1 conv projection to match dimensions.

class BasicBlock(nn.Module):
    expansion = 1
    def __init__(self, in_ch, out_ch, stride=1):
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride, 1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, 1, 1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        self.downsample = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
            nn.BatchNorm2d(out_ch)
        ) if stride != 1 or in_ch != out_ch else nn.Identity()

BottleneckBlock (ResNet-50/101/152)

x → Conv1×1 (reduce) → BN → ReLU
  → Conv3×3           → BN → ReLU
  → Conv1×1 (expand)  → BN → (+x) → ReLU

Why bottleneck? Reduces channels before the expensive 3×3 conv, then restores. For 256-channel input:

  • BasicBlock: $256 \times 256 \times 3 \times 3 \times 2 \approx 1.2$M FLOPs
  • Bottleneck: $256{\times}64{\times}1^2 + 64{\times}64{\times}3^2 + 64{\times}256{\times}1^2 \approx 70$K FLOPs

BatchNorm in Deep Networks

$$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \quad \rightarrow \quad y = \gamma \hat{x} + \beta$$

Benefits: (1) Reduces internal covariate shift. (2) Acts as regularizer. (3) Allows higher LR. (4) Makes optimization landscape smoother.

Placement: BN after Conv, before ReLU (He et al. original). Pre-activation BN (BN before Conv) is sometimes better for very deep networks.


ResNet Architecture Summary

ModelBlocksParamsTop-1 (ImageNet)
ResNet-18Basic11.7M69.8%
ResNet-34Basic21.8M73.3%
ResNet-50Bottleneck25.6M76.1%
ResNet-101Bottleneck44.5M77.4%
ResNet-152Bottleneck60.2M78.3%

What the Lab Covers

FunctionConcept
vanishing_gradient_demo()Gradient norms per layer, plain vs ResNet
BasicBlockExact paper implementation with downsample
BottleneckBlockChannel reduction pipeline
ResNet18() / ResNet50()Full architecture from scratch
batchnorm_effect_demo()Training stability with/without BN
layer_activation_stats()Mean/std of activations across depth

Interview Questions

Q: Why doesn't a deeper plain network always perform better? A: Optimization difficulty, not expressiveness. A 56-layer plain net has higher training error than a 20-layer one (He et al. 2015). Skip connections provide direct gradient paths, enabling effective training.

Q: ResNet-50 has the same depth as VGG-16 but better accuracy. Why? A: Bottleneck blocks are computationally efficient — the 1×1 convolutions reduce/restore channels. This allows 50 layers with fewer FLOPs than VGG-16's 16 layers (3.8B vs 15.5B FLOPs).

Q: What is the role of bias=False when using BatchNorm? A: BatchNorm has its own learned bias $\beta$. The conv bias is redundant and would be subtracted out by BN's mean normalization — so we omit it to save parameters.

Q: How does torch.compile() speed up ResNet training? A: It fuses operator kernels (e.g., Conv+BN+ReLU into one CUDA kernel), eliminating memory roundtrips between operations. Typically 10-30% speedup on A100.


Run

pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/