Lab 04: Mask R-CNN — Instance Segmentation

Overview

Mask R-CNN extends Faster R-CNN by adding a mask branch: a small FCN (Fully Convolutional Network) that predicts a binary segmentation mask for each detected object independently.

Faster R-CNN Head
├── Box classifier (C+1 classes)
├── Box regressor (C×4 deltas)
└── [NEW] Mask head: FCN → 28×28 binary mask per class

The key insight: decouple mask prediction from class prediction. The mask head predicts K masks (one per class) for each proposal, but only the mask corresponding to the predicted class is used at inference.

Architecture

Input Image
     │
  FPN Backbone (ResNet-50-FPN or ResNet-101-FPN)
     │
  RPN → Region Proposals
     │
  RoI Align (7×7 for box head, 14×14 for mask head)
     │
  ┌──────────────────────────────┐
  │ Box Head (FC layers)         │ → class scores + box deltas
  └──────────────────────────────┘
  ┌──────────────────────────────┐
  │ Mask Head (FCN)              │ → K × 28×28 masks
  │ 4× (256-ch conv3×3 → ReLU)  │
  │ Transposed conv 2× upsample  │
  │ 1×1 conv → K binary masks    │
  └──────────────────────────────┘

Why RoI Align (not RoI Pooling) is critical for masks

For bounding boxes, a 1-2 pixel misalignment is tolerable. For segmentation masks at 28×28 resolution, even half-pixel misalignment causes boundary artifacts. RoI Align's exact bilinear interpolation is non-negotiable here.

Mask Prediction Details

The mask head predicts logits for K classes, size 28×28 per RoI.

Training: For each proposal:

Only train the mask branch for GT-matched positive proposals (IoU > 0.5)
Use the GT class to select which mask channel to compute loss on
Loss: sigmoid BCE on 28×28 binary mask

Inference:

Select mask channel corresponding to predicted class
Apply sigmoid → binary threshold at 0.5
Resize from 28×28 back to proposal bounding box size
Paste into full image canvas

Semantic vs Instance Segmentation

	Semantic	Instance
Distinguishes instances?	No	Yes
Same-class objects	Same label	Different IDs
Handles overlap?	No	Yes
Output	H×W label map	N masks per image
Typical architecture	FCN, U-Net, DeepLab	Mask R-CNN, SOLO

Panoptic segmentation = semantic + instance combined (every pixel labeled + every instance identified).

Loss Function

$$\mathcal{L}{total} = \mathcal{L}{rpn_cls} + \mathcal{L}{rpn_reg} + \mathcal{L}{cls} + \mathcal{L}{reg} + \mathcal{L}{mask}$$

The mask loss $\mathcal{L}_{mask}$ is sigmoid BCE (not softmax CE):

Each of the K masks is predicted independently
No competition between classes forces the network to learn class-specific masks
During training: only use mask for GT class → no noise from other classes

Modern Variants

Model	Improvement	Speed
Mask R-CNN	Baseline	~5 FPS (ResNet-50)
SOLO	No RoIs, direct per-position masks	~10 FPS
SOLOv2	Dynamic convolutions	~15 FPS
PointRend	Render masks at uncertain boundary points	+1-2 mAP
Mask2Former	Transformer-based, universal segmentation	SOTA

Interview Questions

Q: Why does Mask R-CNN predict K masks (one per class) instead of 1 mask with K classes?

A: Using K independent binary masks decouples mask prediction from classification. Each binary sigmoid mask doesn't need to "compete" with other classes — it only asks "is this pixel part of class k?". Using softmax would force the model to classify every pixel even within the mask head, introducing entanglement. The selected mask (at inference: the predicted class's mask) will be cleaner and more accurate. This design choice yielded a 3+ point mAP improvement over single-channel mask prediction.

Q: How does Mask R-CNN handle overlapping instances?

A: Each proposal generates an independent mask crop. The model processes them separately, so masks can overlap in image coordinates. At output, overlapping masks are handled by confidence — typically the highest-confidence instance "wins" each pixel, or both masks are kept and the caller resolves the overlap (e.g., rendering order by confidence). Occlusion handling for heavily overlapping objects (like stacked items) remains a weakness; SOLO-based methods handle it better via position-based instance separation.

Q: What is the typical training procedure for Mask R-CNN on a custom dataset?

A: (1) Start from COCO-pretrained weights (all backbone + FPN + RPN + heads pretrained); (2) Fine-tune all components with discriminative LRs (backbone 0.1× LR, heads 1× LR); (3) Use 1× or 3× training schedule (12 or 36 epochs on COCO); (4) Data augmentation: horizontal flip, multi-scale train (480-800px shorter edge), optional mosaic. For small datasets (< 1000 images), freeze BatchNorm layers (model.backbone.body.freeze_bn()) and use batch size ≥ 2 (BN stat accuracy degrades with batch=1).

AI Engineer — Role-Based Learning Hub