Lab 02: Faster R-CNN — Two-Stage Object Detection
Motivation: Why Two-Stage?
Single-stage detectors (YOLO) are fast but sacrifice accuracy on small/dense objects. Two-stage detectors decouple:
- Stage 1 — Region Proposal Network (RPN): "Where could objects be?"
- Stage 2 — RoI Head: "What exactly is this object and where precisely?"
This separation allows specialized optimization of localization vs classification.
Architecture Deep Dive
Input Image (H × W × 3)
│
Backbone (e.g., ResNet-50-FPN)
├─ C1: /2 (stride 2)
├─ C2: /4 (stride 4)
├─ C3: /8 (stride 8)
├─ C4: /16 (stride 16)
└─ C5: /32 (stride 32)
│
FPN (Feature Pyramid Network)
• Top-down pathway + lateral connections
• Produces: P2, P3, P4, P5, P6
• Each Pi resolves objects at a different scale
│
RPN (Region Proposal Network)
• Slides 3×3 conv over each feature map level
• At each location, k=3 aspect ratios × 3 sizes = 9 anchors
• Outputs per anchor: objectness score (fg/bg) + bbox delta
│
RoI Align
• Project each proposal back to feature map coordinates
• Sample 2×2 bilinear interpolation grid in each RoI
• Output: fixed 7×7 feature map per proposal
│
Box Head (FC layers)
├─ Classifier: Softmax over (C+1) classes (background = class 0)
└─ Regressor: 4×C box deltas (class-specific regression)
Region Proposal Network (RPN)
Anchor Generation
For each spatial location $(i, j)$ on a feature map of stride $s$:
- Center: $(i \cdot s + s/2, ; j \cdot s + s/2)$
- Scales: ${32^2, 64^2, 128^2, 256^2, 512^2}$ pixels²
- Aspect ratios: ${1:2, ; 1:1, ; 2:1}$
Total anchors: $H/s \times W/s \times 9$ per feature map level.
RPN Loss
$$\mathcal{L}{RPN} = \frac{1}{N{cls}} \sum_i \mathcal{L}{cls}(p_i, p_i^*) + \lambda \frac{1}{N{reg}} \sum_i p_i^* \mathcal{L}_{reg}(t_i, t_i^*)$$
- $p_i$: predicted objectness probability for anchor $i$
- $p_i^*$: 1 if anchor overlaps GT with IoU > 0.7, 0 if IoU < 0.3
- $t_i$: predicted box parameterization
- $\mathcal{L}_{reg}$: Smooth L1 loss (robust to outliers)
Box Parameterization
$$t_x = (x - x_a) / w_a, \quad t_y = (y - y_a) / h_a$$ $$t_w = \log(w / w_a), \quad t_h = \log(h / h_a)$$
Log for width/height: prevents negative predictions and ensures scale-invariant regression.
Smooth L1 Loss
$$\text{SmoothL1}(x) = \begin{cases} 0.5 x^2 & |x| < 1 \ |x| - 0.5 & \text{otherwise} \end{cases}$$
Advantage over L2: linear for large errors (not dominated by outliers), quadratic for small errors (smooth gradient near 0).
RoI Align vs RoI Pooling
RoI Pooling (Faster R-CNN original):
- Quantizes proposal coordinates to feature map grid
- Causes misalignment: a pixel shift in proposal → different feature
- Hurts small-object detection and instance segmentation
RoI Align (Mask R-CNN):
- No quantization — uses bilinear interpolation
- Divides RoI into fixed-size grid (e.g., 7×7)
- For each cell, samples 4 points with bilinear interpolation
- Eliminates misalignment → crucial for segmentation
$$\text{RoIAlign}(x, y) = \sum_{ij} w_{ij} \cdot \text{feature}(x_i, y_j)$$
where $w_{ij}$ are bilinear interpolation weights.
FPN (Feature Pyramid Network)
Solves scale variation: small objects need high-resolution features, large objects need semantic features.
# Top-down pathway
P5 = conv(C5)
P4 = conv(C4) + upsample(P5) # lateral connection
P3 = conv(C3) + upsample(P4)
P2 = conv(C2) + upsample(P3)
Assignment rule: proposal of area $A$ goes to level $k$: $$k = k_0 + \lfloor \log_2(\sqrt{A} / 224) \rfloor$$
Interview Questions
Q: What's the role of anchor boxes in Faster R-CNN? Are they still needed?
A: Anchors define a prior distribution over object shapes. The RPN predicts offsets from anchors, not absolute coordinates — this makes training easier since the network only needs to learn small corrections. Modern detectors like FCOS and YOLOv8 are anchor-free: they directly predict coordinates from each grid cell. The trade-off: anchor-based requires careful anchor design but is more stable; anchor-free is simpler and generalizes better to unusual aspect ratios.
Q: Why does Faster R-CNN use separate losses for RPN and RoI head?
A: Each stage has different optimization targets. The RPN must learn to identify foreground vs background and roughly localize objects — it needs many examples and a high recall. The RoI head must distinguish 80+ classes precisely. Training them separately with different learning rates allows each to converge optimally. If trained jointly with naive averaging, the RPN loss often dominates.
Q: How does Non-Maximum Suppression reduce redundant proposals in the RPN?
A: After computing ~100K anchor scores, NMS keeps at most 2000 proposals for training (300 at test time). Process: (1) filter anchors with score < threshold (0.7), (2) clip to image boundary, (3) remove very small anchors (< 16px), (4) sort remaining by score, (5) greedily keep anchors with IoU < 0.7 with all previously kept anchors. This reduces 100K → 2000 proposals while maintaining diversity.