Lab 04: Mask R-CNN — Instance Segmentation

Overview

Mask R-CNN extends Faster R-CNN by adding a mask branch: a small FCN (Fully Convolutional Network) that predicts a binary segmentation mask for each detected object independently.

Faster R-CNN Head
├── Box classifier (C+1 classes)
├── Box regressor (C×4 deltas)
└── [NEW] Mask head: FCN → 28×28 binary mask per class

The key insight: decouple mask prediction from class prediction. The mask head predicts K masks (one per class) for each proposal, but only the mask corresponding to the predicted class is used at inference.


Architecture

Input Image
     │
  FPN Backbone (ResNet-50-FPN or ResNet-101-FPN)
     │
  RPN → Region Proposals
     │
  RoI Align (7×7 for box head, 14×14 for mask head)
     │
  ┌──────────────────────────────┐
  │ Box Head (FC layers)         │ → class scores + box deltas
  └──────────────────────────────┘
  ┌──────────────────────────────┐
  │ Mask Head (FCN)              │ → K × 28×28 masks
  │ 4× (256-ch conv3×3 → ReLU)  │
  │ Transposed conv 2× upsample  │
  │ 1×1 conv → K binary masks    │
  └──────────────────────────────┘

Why RoI Align (not RoI Pooling) is critical for masks

For bounding boxes, a 1-2 pixel misalignment is tolerable. For segmentation masks at 28×28 resolution, even half-pixel misalignment causes boundary artifacts. RoI Align's exact bilinear interpolation is non-negotiable here.


Mask Prediction Details

The mask head predicts logits for K classes, size 28×28 per RoI.

Training: For each proposal:

  1. Only train the mask branch for GT-matched positive proposals (IoU > 0.5)
  2. Use the GT class to select which mask channel to compute loss on
  3. Loss: sigmoid BCE on 28×28 binary mask

Inference:

  1. Select mask channel corresponding to predicted class
  2. Apply sigmoid → binary threshold at 0.5
  3. Resize from 28×28 back to proposal bounding box size
  4. Paste into full image canvas

Semantic vs Instance Segmentation

SemanticInstance
Distinguishes instances?NoYes
Same-class objectsSame labelDifferent IDs
Handles overlap?NoYes
OutputH×W label mapN masks per image
Typical architectureFCN, U-Net, DeepLabMask R-CNN, SOLO

Panoptic segmentation = semantic + instance combined (every pixel labeled + every instance identified).


Loss Function

$$\mathcal{L}{total} = \mathcal{L}{rpn_cls} + \mathcal{L}{rpn_reg} + \mathcal{L}{cls} + \mathcal{L}{reg} + \mathcal{L}{mask}$$

The mask loss $\mathcal{L}_{mask}$ is sigmoid BCE (not softmax CE):

  • Each of the K masks is predicted independently
  • No competition between classes forces the network to learn class-specific masks
  • During training: only use mask for GT class → no noise from other classes

Modern Variants

ModelImprovementSpeed
Mask R-CNNBaseline~5 FPS (ResNet-50)
SOLONo RoIs, direct per-position masks~10 FPS
SOLOv2Dynamic convolutions~15 FPS
PointRendRender masks at uncertain boundary points+1-2 mAP
Mask2FormerTransformer-based, universal segmentationSOTA

Interview Questions

Q: Why does Mask R-CNN predict K masks (one per class) instead of 1 mask with K classes?

A: Using K independent binary masks decouples mask prediction from classification. Each binary sigmoid mask doesn't need to "compete" with other classes — it only asks "is this pixel part of class k?". Using softmax would force the model to classify every pixel even within the mask head, introducing entanglement. The selected mask (at inference: the predicted class's mask) will be cleaner and more accurate. This design choice yielded a 3+ point mAP improvement over single-channel mask prediction.

Q: How does Mask R-CNN handle overlapping instances?

A: Each proposal generates an independent mask crop. The model processes them separately, so masks can overlap in image coordinates. At output, overlapping masks are handled by confidence — typically the highest-confidence instance "wins" each pixel, or both masks are kept and the caller resolves the overlap (e.g., rendering order by confidence). Occlusion handling for heavily overlapping objects (like stacked items) remains a weakness; SOLO-based methods handle it better via position-based instance separation.

Q: What is the typical training procedure for Mask R-CNN on a custom dataset?

A: (1) Start from COCO-pretrained weights (all backbone + FPN + RPN + heads pretrained); (2) Fine-tune all components with discriminative LRs (backbone 0.1× LR, heads 1× LR); (3) Use 1× or 3× training schedule (12 or 36 epochs on COCO); (4) Data augmentation: horizontal flip, multi-scale train (480-800px shorter edge), optional mosaic. For small datasets (< 1000 images), freeze BatchNorm layers (model.backbone.body.freeze_bn()) and use batch size ≥ 2 (BN stat accuracy degrades with batch=1).