Lab 03 — ResNet from Scratch

Phase 3: PyTorch | Week 8-9

ResNet solved the vanishing gradient problem that blocked deep networks for years. Understanding skip connections is non-negotiable for any CV engineer interview.

Learning Objectives

Prove the vanishing gradient problem experimentally
Implement BasicBlock and BottleneckBlock from the original paper
Build ResNet-18 and ResNet-50 from scratch
Understand BatchNorm's role in deep network training
Compare training dynamics: plain network vs ResNet

Theory

Vanishing Gradient Problem

For a network with $L$ layers, the gradient of the loss w.r.t. weights at layer $k$:

$$\frac{\partial \mathcal{L}}{\partial W_k} = \frac{\partial \mathcal{L}}{\partial a_L} \cdot \prod_{i=k}^{L} \frac{\partial a_i}{\partial a_{i-1}}$$

If $\frac{\partial a_i}{\partial a_{i-1}} = \sigma'(z_i) W_i$ and $|\sigma'| < 1$ (sigmoid saturates), the product shrinks exponentially → gradients vanish.

ReLU helps ($\sigma'(z) = 1$ for $z > 0$), but multiplying many weight matrices still causes issues.

Residual Block — The Key Idea

Instead of learning $H(x)$ directly, learn the residual:

$$H(x) = \mathcal{F}(x) + x$$

$$\mathcal{F}(x) = H(x) - x$$

If the optimal solution is close to identity, $\mathcal{F}(x) \approx 0$ — much easier to learn than $H(x) \approx x$.

Gradient flow: $$\frac{\partial \mathcal{L}}{\partial x} = \frac{\partial \mathcal{L}}{\partial H} \cdot \left(\frac{\partial \mathcal{F}}{\partial x} + 1\right)$$

The $+1$ ensures gradient is at least 1 even if $\frac{\partial \mathcal{F}}{\partial x} \approx 0$ — gradient highway.

BasicBlock (ResNet-18/34)

x → Conv3×3 → BN → ReLU → Conv3×3 → BN → (+x) → ReLU

When channels change: use 1×1 conv projection to match dimensions.

class BasicBlock(nn.Module):
    expansion = 1
    def __init__(self, in_ch, out_ch, stride=1):
        self.conv1 = nn.Conv2d(in_ch, out_ch, 3, stride, 1, bias=False)
        self.bn1   = nn.BatchNorm2d(out_ch)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, 1, 1, bias=False)
        self.bn2   = nn.BatchNorm2d(out_ch)
        self.downsample = nn.Sequential(
            nn.Conv2d(in_ch, out_ch, 1, stride, bias=False),
            nn.BatchNorm2d(out_ch)
        ) if stride != 1 or in_ch != out_ch else nn.Identity()

BottleneckBlock (ResNet-50/101/152)

x → Conv1×1 (reduce) → BN → ReLU
  → Conv3×3           → BN → ReLU
  → Conv1×1 (expand)  → BN → (+x) → ReLU

Why bottleneck? Reduces channels before the expensive 3×3 conv, then restores. For 256-channel input:

BasicBlock: $256 \times 256 \times 3 \times 3 \times 2 \approx 1.2$M FLOPs
Bottleneck: $256{\times}64{\times}1^2 + 64{\times}64{\times}3^2 + 64{\times}256{\times}1^2 \approx 70$K FLOPs

BatchNorm in Deep Networks

$$\hat{x} = \frac{x - \mu_B}{\sqrt{\sigma_B^2 + \epsilon}} \quad \rightarrow \quad y = \gamma \hat{x} + \beta$$

Benefits: (1) Reduces internal covariate shift. (2) Acts as regularizer. (3) Allows higher LR. (4) Makes optimization landscape smoother.

Placement: BN after Conv, before ReLU (He et al. original). Pre-activation BN (BN before Conv) is sometimes better for very deep networks.

ResNet Architecture Summary

Model	Blocks	Params	Top-1 (ImageNet)
ResNet-18	Basic	11.7M	69.8%
ResNet-34	Basic	21.8M	73.3%
ResNet-50	Bottleneck	25.6M	76.1%
ResNet-101	Bottleneck	44.5M	77.4%
ResNet-152	Bottleneck	60.2M	78.3%

What the Lab Covers

Function	Concept
`vanishing_gradient_demo()`	Gradient norms per layer, plain vs ResNet
`BasicBlock`	Exact paper implementation with downsample
`BottleneckBlock`	Channel reduction pipeline
`ResNet18()` / `ResNet50()`	Full architecture from scratch
`batchnorm_effect_demo()`	Training stability with/without BN
`layer_activation_stats()`	Mean/std of activations across depth

Q: Why doesn't a deeper plain network always perform better? A: Optimization difficulty, not expressiveness. A 56-layer plain net has higher training error than a 20-layer one (He et al. 2015). Skip connections provide direct gradient paths, enabling effective training.

Q: ResNet-50 has the same depth as VGG-16 but better accuracy. Why? A: Bottleneck blocks are computationally efficient — the 1×1 convolutions reduce/restore channels. This allows 50 layers with fewer FLOPs than VGG-16's 16 layers (3.8B vs 15.5B FLOPs).

Q: What is the role of bias=False when using BatchNorm? A: BatchNorm has its own learned bias $\beta$. The conv bias is redundant and would be subtracted out by BN's mean normalization — so we omit it to save parameters.

Q: How does torch.compile() speed up ResNet training? A: It fuses operator kernels (e.g., Conv+BN+ReLU into one CUDA kernel), eliminating memory roundtrips between operations. Typically 10-30% speedup on A100.

Run

pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/

AI Engineer — Role-Based Learning Hub