Lab 04: Mask R-CNN — Instance Segmentation
Overview
Mask R-CNN extends Faster R-CNN by adding a mask branch: a small FCN (Fully Convolutional Network) that predicts a binary segmentation mask for each detected object independently.
Faster R-CNN Head
├── Box classifier (C+1 classes)
├── Box regressor (C×4 deltas)
└── [NEW] Mask head: FCN → 28×28 binary mask per class
The key insight: decouple mask prediction from class prediction. The mask head predicts K masks (one per class) for each proposal, but only the mask corresponding to the predicted class is used at inference.
Architecture
Input Image
│
FPN Backbone (ResNet-50-FPN or ResNet-101-FPN)
│
RPN → Region Proposals
│
RoI Align (7×7 for box head, 14×14 for mask head)
│
┌──────────────────────────────┐
│ Box Head (FC layers) │ → class scores + box deltas
└──────────────────────────────┘
┌──────────────────────────────┐
│ Mask Head (FCN) │ → K × 28×28 masks
│ 4× (256-ch conv3×3 → ReLU) │
│ Transposed conv 2× upsample │
│ 1×1 conv → K binary masks │
└──────────────────────────────┘
Why RoI Align (not RoI Pooling) is critical for masks
For bounding boxes, a 1-2 pixel misalignment is tolerable. For segmentation masks at 28×28 resolution, even half-pixel misalignment causes boundary artifacts. RoI Align's exact bilinear interpolation is non-negotiable here.
Mask Prediction Details
The mask head predicts logits for K classes, size 28×28 per RoI.
Training: For each proposal:
- Only train the mask branch for GT-matched positive proposals (IoU > 0.5)
- Use the GT class to select which mask channel to compute loss on
- Loss: sigmoid BCE on 28×28 binary mask
Inference:
- Select mask channel corresponding to predicted class
- Apply sigmoid → binary threshold at 0.5
- Resize from 28×28 back to proposal bounding box size
- Paste into full image canvas
Semantic vs Instance Segmentation
| Semantic | Instance | |
|---|---|---|
| Distinguishes instances? | No | Yes |
| Same-class objects | Same label | Different IDs |
| Handles overlap? | No | Yes |
| Output | H×W label map | N masks per image |
| Typical architecture | FCN, U-Net, DeepLab | Mask R-CNN, SOLO |
Panoptic segmentation = semantic + instance combined (every pixel labeled + every instance identified).
Loss Function
$$\mathcal{L}{total} = \mathcal{L}{rpn_cls} + \mathcal{L}{rpn_reg} + \mathcal{L}{cls} + \mathcal{L}{reg} + \mathcal{L}{mask}$$
The mask loss $\mathcal{L}_{mask}$ is sigmoid BCE (not softmax CE):
- Each of the K masks is predicted independently
- No competition between classes forces the network to learn class-specific masks
- During training: only use mask for GT class → no noise from other classes
Modern Variants
| Model | Improvement | Speed |
|---|---|---|
| Mask R-CNN | Baseline | ~5 FPS (ResNet-50) |
| SOLO | No RoIs, direct per-position masks | ~10 FPS |
| SOLOv2 | Dynamic convolutions | ~15 FPS |
| PointRend | Render masks at uncertain boundary points | +1-2 mAP |
| Mask2Former | Transformer-based, universal segmentation | SOTA |
Interview Questions
Q: Why does Mask R-CNN predict K masks (one per class) instead of 1 mask with K classes?
A: Using K independent binary masks decouples mask prediction from classification. Each binary sigmoid mask doesn't need to "compete" with other classes — it only asks "is this pixel part of class k?". Using softmax would force the model to classify every pixel even within the mask head, introducing entanglement. The selected mask (at inference: the predicted class's mask) will be cleaner and more accurate. This design choice yielded a 3+ point mAP improvement over single-channel mask prediction.
Q: How does Mask R-CNN handle overlapping instances?
A: Each proposal generates an independent mask crop. The model processes them separately, so masks can overlap in image coordinates. At output, overlapping masks are handled by confidence — typically the highest-confidence instance "wins" each pixel, or both masks are kept and the caller resolves the overlap (e.g., rendering order by confidence). Occlusion handling for heavily overlapping objects (like stacked items) remains a weakness; SOLO-based methods handle it better via position-based instance separation.
Q: What is the typical training procedure for Mask R-CNN on a custom dataset?
A: (1) Start from COCO-pretrained weights (all backbone + FPN + RPN + heads pretrained); (2) Fine-tune all components with discriminative LRs (backbone 0.1× LR, heads 1× LR); (3) Use 1× or 3× training schedule (12 or 36 epochs on COCO); (4) Data augmentation: horizontal flip, multi-scale train (480-800px shorter edge), optional mosaic. For small datasets (< 1000 images), freeze BatchNorm layers (model.backbone.body.freeze_bn()) and use batch size ≥ 2 (BN stat accuracy degrades with batch=1).