Lab 02: Faster R-CNN — Two-Stage Object Detection

Motivation: Why Two-Stage?

Single-stage detectors (YOLO) are fast but sacrifice accuracy on small/dense objects. Two-stage detectors decouple:

  1. Stage 1 — Region Proposal Network (RPN): "Where could objects be?"
  2. Stage 2 — RoI Head: "What exactly is this object and where precisely?"

This separation allows specialized optimization of localization vs classification.


Architecture Deep Dive

Input Image (H × W × 3)
        │
   Backbone (e.g., ResNet-50-FPN)
   ├─ C1: /2    (stride 2)
   ├─ C2: /4    (stride 4)
   ├─ C3: /8    (stride 8)
   ├─ C4: /16   (stride 16)
   └─ C5: /32   (stride 32)
        │
   FPN (Feature Pyramid Network)
   • Top-down pathway + lateral connections
   • Produces: P2, P3, P4, P5, P6
   • Each Pi resolves objects at a different scale
        │
   RPN (Region Proposal Network)
   • Slides 3×3 conv over each feature map level
   • At each location, k=3 aspect ratios × 3 sizes = 9 anchors
   • Outputs per anchor: objectness score (fg/bg) + bbox delta
        │
   RoI Align
   • Project each proposal back to feature map coordinates
   • Sample 2×2 bilinear interpolation grid in each RoI
   • Output: fixed 7×7 feature map per proposal
        │
   Box Head (FC layers)
   ├─ Classifier: Softmax over (C+1) classes (background = class 0)
   └─ Regressor: 4×C box deltas (class-specific regression)

Region Proposal Network (RPN)

Anchor Generation

For each spatial location $(i, j)$ on a feature map of stride $s$:

  • Center: $(i \cdot s + s/2, ; j \cdot s + s/2)$
  • Scales: ${32^2, 64^2, 128^2, 256^2, 512^2}$ pixels²
  • Aspect ratios: ${1:2, ; 1:1, ; 2:1}$

Total anchors: $H/s \times W/s \times 9$ per feature map level.

RPN Loss

$$\mathcal{L}{RPN} = \frac{1}{N{cls}} \sum_i \mathcal{L}{cls}(p_i, p_i^*) + \lambda \frac{1}{N{reg}} \sum_i p_i^* \mathcal{L}_{reg}(t_i, t_i^*)$$

  • $p_i$: predicted objectness probability for anchor $i$
  • $p_i^*$: 1 if anchor overlaps GT with IoU > 0.7, 0 if IoU < 0.3
  • $t_i$: predicted box parameterization
  • $\mathcal{L}_{reg}$: Smooth L1 loss (robust to outliers)

Box Parameterization

$$t_x = (x - x_a) / w_a, \quad t_y = (y - y_a) / h_a$$ $$t_w = \log(w / w_a), \quad t_h = \log(h / h_a)$$

Log for width/height: prevents negative predictions and ensures scale-invariant regression.

Smooth L1 Loss

$$\text{SmoothL1}(x) = \begin{cases} 0.5 x^2 & |x| < 1 \ |x| - 0.5 & \text{otherwise} \end{cases}$$

Advantage over L2: linear for large errors (not dominated by outliers), quadratic for small errors (smooth gradient near 0).


RoI Align vs RoI Pooling

RoI Pooling (Faster R-CNN original):

  • Quantizes proposal coordinates to feature map grid
  • Causes misalignment: a pixel shift in proposal → different feature
  • Hurts small-object detection and instance segmentation

RoI Align (Mask R-CNN):

  • No quantization — uses bilinear interpolation
  • Divides RoI into fixed-size grid (e.g., 7×7)
  • For each cell, samples 4 points with bilinear interpolation
  • Eliminates misalignment → crucial for segmentation

$$\text{RoIAlign}(x, y) = \sum_{ij} w_{ij} \cdot \text{feature}(x_i, y_j)$$

where $w_{ij}$ are bilinear interpolation weights.


FPN (Feature Pyramid Network)

Solves scale variation: small objects need high-resolution features, large objects need semantic features.

# Top-down pathway
P5 = conv(C5)
P4 = conv(C4) + upsample(P5)  # lateral connection
P3 = conv(C3) + upsample(P4)
P2 = conv(C2) + upsample(P3)

Assignment rule: proposal of area $A$ goes to level $k$: $$k = k_0 + \lfloor \log_2(\sqrt{A} / 224) \rfloor$$


Interview Questions

Q: What's the role of anchor boxes in Faster R-CNN? Are they still needed?

A: Anchors define a prior distribution over object shapes. The RPN predicts offsets from anchors, not absolute coordinates — this makes training easier since the network only needs to learn small corrections. Modern detectors like FCOS and YOLOv8 are anchor-free: they directly predict coordinates from each grid cell. The trade-off: anchor-based requires careful anchor design but is more stable; anchor-free is simpler and generalizes better to unusual aspect ratios.

Q: Why does Faster R-CNN use separate losses for RPN and RoI head?

A: Each stage has different optimization targets. The RPN must learn to identify foreground vs background and roughly localize objects — it needs many examples and a high recall. The RoI head must distinguish 80+ classes precisely. Training them separately with different learning rates allows each to converge optimally. If trained jointly with naive averaging, the RPN loss often dominates.

Q: How does Non-Maximum Suppression reduce redundant proposals in the RPN?

A: After computing ~100K anchor scores, NMS keeps at most 2000 proposals for training (300 at test time). Process: (1) filter anchors with score < threshold (0.7), (2) clip to image boundary, (3) remove very small anchors (< 16px), (4) sort remaining by score, (5) greedily keep anchors with IoU < 0.7 with all previously kept anchors. This reduces 100K → 2000 proposals while maintaining diversity.