Lab 03: U-Net — Semantic Segmentation
What is Semantic Segmentation?
Assign a class label to every pixel in an image (vs detection which predicts bounding boxes).
| Task | Output | Example |
|---|---|---|
| Classification | 1 label per image | "This is a cat" |
| Detection | Bounding boxes | "Cat at [x1,y1,x2,y2]" |
| Semantic segmentation | Label per pixel | Each pixel = car/road/sky |
| Instance segmentation | Label+ID per pixel | Car #1, Car #2, background |
U-Net Architecture
Originally designed for biomedical image segmentation (2015). Now used universally.
Input (572×572×1) — or any (H×W×C)
│
Encoder (Contracting Path)
┌─────────────────────────────────────────┐
│ Block 1: 3×3 conv → 3×3 conv → MaxPool │ 64 channels → skip₁
│ Block 2: 3×3 conv → 3×3 conv → MaxPool │ 128 channels → skip₂
│ Block 3: 3×3 conv → 3×3 conv → MaxPool │ 256 channels → skip₃
│ Block 4: 3×3 conv → 3×3 conv → MaxPool │ 512 channels → skip₄
└─────────────────────────────────────────┘
│
Bottleneck: 3×3 conv → 3×3 conv │ 1024 channels
│
Decoder (Expanding Path)
┌───────────────────────────────────────────────────────────┐
│ Upsample 2× → concat(skip₄) → 3×3 conv → 3×3 conv │ 512 ch
│ Upsample 2× → concat(skip₃) → 3×3 conv → 3×3 conv │ 256 ch
│ Upsample 2× → concat(skip₂) → 3×3 conv → 3×3 conv │ 128 ch
│ Upsample 2× → concat(skip₁) → 3×3 conv → 3×3 conv │ 64 ch
└───────────────────────────────────────────────────────────┘
│
1×1 conv → N_classes channels → Softmax per pixel
Why Skip Connections?
Downsampling loses spatial information. Upsampling alone produces blurry boundaries. Skip connections bring back fine-grained details from the encoder.
- Encoder features: semantic information ("this region is a tumor")
- Skip connection: spatial details ("exact boundary of the tumor")
- Combined: precise, semantically-aware segmentation
Loss Functions
Binary Cross-Entropy (BCE) for binary segmentation
$$\mathcal{L}{BCE} = -\frac{1}{N} \sum{i} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$
Problem: Massive class imbalance. In medical imaging, foreground may be 5% of pixels. BCE optimizes pixel accuracy → model learns to predict "all background" and achieves 95% accuracy.
Dice Loss
Based on the Dice coefficient / F1 score:
$$\text{Dice} = \frac{2 |A \cap B|}{|A| + |B|} = \frac{2 \sum_{i} p_i g_i}{\sum_i p_i + \sum_i g_i}$$
$$\mathcal{L}_{Dice} = 1 - \text{Dice}$$
Why it handles imbalance: Dice loss is normalized by both prediction size and GT size. Even if the foreground is 5% of pixels, a correct prediction is fully rewarded.
Combined Loss (standard practice)
$$\mathcal{L} = \mathcal{L}{Dice} + \mathcal{L}{BCE}$$
This combines Dice (handles imbalance) with BCE (provides pointwise gradients).
Focal Loss variant for segmentation
Focal Dice: downweight easy pixels (confident background) to focus on hard positives.
Evaluation Metrics
Pixel Accuracy
$$\text{Acc} = \frac{\text{Correct pixels}}{\text{Total pixels}}$$
Misleading for imbalanced classes (95% background → 95% acc trivially).
Mean IoU (mIoU)
$$\text{mIoU} = \frac{1}{C} \sum_{c=0}^{C-1} \frac{TP_c}{TP_c + FP_c + FN_c}$$
Gold standard for segmentation. Computes IoU per class, then averages. Penalizes both over- and under-segmentation equally.
Dice Score
$$\text{Dice} = \frac{2 TP}{2 TP + FP + FN}$$
Identical to F1-score. Popular in medical imaging (equivalent to mIoU for binary case via mathematical relationship).
Interview Questions
Q: When would you use Dice loss vs BCE for segmentation?
A: For imbalanced datasets (medical imaging, defect detection where lesion < 5% of pixels), always use Dice or Dice+BCE. Dice normalizes by prediction size, so even rare classes get proper gradients. For balanced segmentation (outdoor scenes like Cityscapes where all classes have similar frequencies), BCE or cross-entropy works fine. In practice, Dice+BCE combined consistently outperforms either alone — BCE provides dense gradients, Dice corrects for imbalance.
Q: What's the difference between transposed convolution and bilinear upsampling + conv?
A: Transposed convolution learns upsampling weights (8× parameters for upsampling), which can produce "checkerboard artifacts" from uneven gradient overlap. Bilinear upsampling is parameter-free and smooth, followed by a regular conv for learned feature processing. The bilinear+conv approach is now preferred in most architectures (including U-Net++ and modern variants) because it avoids artifacts and is more stable to train. Memory footprint is also lower.
Q: How would you adapt U-Net for 3D medical images (CT/MRI volumes)?
A: Replace all 2D operations with 3D equivalents: nn.Conv2d→nn.Conv3d, nn.MaxPool2d→nn.MaxPool3d, nn.BatchNorm2d→nn.BatchNorm3d. The challenge is memory: a 512³ volume with 64 channels at float32 = 8GB. Solutions: (1) patch-based training (crop 128³ overlapping patches, stitch at test time); (2) mixed 2D+3D (2D encoder, 3D decoder for memory efficiency); (3) anisotropic convolutions for data with non-cubic voxels (CT often 0.5mm in-plane, 2mm slice thickness).