Lab 03: U-Net — Semantic Segmentation

What is Semantic Segmentation?

Assign a class label to every pixel in an image (vs detection which predicts bounding boxes).

TaskOutputExample
Classification1 label per image"This is a cat"
DetectionBounding boxes"Cat at [x1,y1,x2,y2]"
Semantic segmentationLabel per pixelEach pixel = car/road/sky
Instance segmentationLabel+ID per pixelCar #1, Car #2, background

U-Net Architecture

Originally designed for biomedical image segmentation (2015). Now used universally.

Input (572×572×1) — or any (H×W×C)
        │
   Encoder (Contracting Path)
   ┌─────────────────────────────────────────┐
   │ Block 1: 3×3 conv → 3×3 conv → MaxPool  │  64 channels  → skip₁
   │ Block 2: 3×3 conv → 3×3 conv → MaxPool  │ 128 channels  → skip₂
   │ Block 3: 3×3 conv → 3×3 conv → MaxPool  │ 256 channels  → skip₃
   │ Block 4: 3×3 conv → 3×3 conv → MaxPool  │ 512 channels  → skip₄
   └─────────────────────────────────────────┘
        │
   Bottleneck: 3×3 conv → 3×3 conv          │ 1024 channels
        │
   Decoder (Expanding Path)
   ┌───────────────────────────────────────────────────────────┐
   │ Upsample 2× → concat(skip₄) → 3×3 conv → 3×3 conv        │ 512 ch
   │ Upsample 2× → concat(skip₃) → 3×3 conv → 3×3 conv        │ 256 ch
   │ Upsample 2× → concat(skip₂) → 3×3 conv → 3×3 conv        │ 128 ch
   │ Upsample 2× → concat(skip₁) → 3×3 conv → 3×3 conv        │  64 ch
   └───────────────────────────────────────────────────────────┘
        │
   1×1 conv → N_classes channels → Softmax per pixel

Why Skip Connections?

Downsampling loses spatial information. Upsampling alone produces blurry boundaries. Skip connections bring back fine-grained details from the encoder.

  • Encoder features: semantic information ("this region is a tumor")
  • Skip connection: spatial details ("exact boundary of the tumor")
  • Combined: precise, semantically-aware segmentation

Loss Functions

Binary Cross-Entropy (BCE) for binary segmentation

$$\mathcal{L}{BCE} = -\frac{1}{N} \sum{i} [y_i \log \hat{y}_i + (1-y_i) \log(1-\hat{y}_i)]$$

Problem: Massive class imbalance. In medical imaging, foreground may be 5% of pixels. BCE optimizes pixel accuracy → model learns to predict "all background" and achieves 95% accuracy.

Dice Loss

Based on the Dice coefficient / F1 score:

$$\text{Dice} = \frac{2 |A \cap B|}{|A| + |B|} = \frac{2 \sum_{i} p_i g_i}{\sum_i p_i + \sum_i g_i}$$

$$\mathcal{L}_{Dice} = 1 - \text{Dice}$$

Why it handles imbalance: Dice loss is normalized by both prediction size and GT size. Even if the foreground is 5% of pixels, a correct prediction is fully rewarded.

Combined Loss (standard practice)

$$\mathcal{L} = \mathcal{L}{Dice} + \mathcal{L}{BCE}$$

This combines Dice (handles imbalance) with BCE (provides pointwise gradients).

Focal Loss variant for segmentation

Focal Dice: downweight easy pixels (confident background) to focus on hard positives.


Evaluation Metrics

Pixel Accuracy

$$\text{Acc} = \frac{\text{Correct pixels}}{\text{Total pixels}}$$

Misleading for imbalanced classes (95% background → 95% acc trivially).

Mean IoU (mIoU)

$$\text{mIoU} = \frac{1}{C} \sum_{c=0}^{C-1} \frac{TP_c}{TP_c + FP_c + FN_c}$$

Gold standard for segmentation. Computes IoU per class, then averages. Penalizes both over- and under-segmentation equally.

Dice Score

$$\text{Dice} = \frac{2 TP}{2 TP + FP + FN}$$

Identical to F1-score. Popular in medical imaging (equivalent to mIoU for binary case via mathematical relationship).


Interview Questions

Q: When would you use Dice loss vs BCE for segmentation?

A: For imbalanced datasets (medical imaging, defect detection where lesion < 5% of pixels), always use Dice or Dice+BCE. Dice normalizes by prediction size, so even rare classes get proper gradients. For balanced segmentation (outdoor scenes like Cityscapes where all classes have similar frequencies), BCE or cross-entropy works fine. In practice, Dice+BCE combined consistently outperforms either alone — BCE provides dense gradients, Dice corrects for imbalance.

Q: What's the difference between transposed convolution and bilinear upsampling + conv?

A: Transposed convolution learns upsampling weights (8× parameters for upsampling), which can produce "checkerboard artifacts" from uneven gradient overlap. Bilinear upsampling is parameter-free and smooth, followed by a regular conv for learned feature processing. The bilinear+conv approach is now preferred in most architectures (including U-Net++ and modern variants) because it avoids artifacts and is more stable to train. Memory footprint is also lower.

Q: How would you adapt U-Net for 3D medical images (CT/MRI volumes)?

A: Replace all 2D operations with 3D equivalents: nn.Conv2d→nn.Conv3d, nn.MaxPool2d→nn.MaxPool3d, nn.BatchNorm2d→nn.BatchNorm3d. The challenge is memory: a 512³ volume with 64 channels at float32 = 8GB. Solutions: (1) patch-based training (crop 128³ overlapping patches, stitch at test time); (2) mixed 2D+3D (2D encoder, 3D decoder for memory efficiency); (3) anisotropic convolutions for data with non-cubic voxels (CT often 0.5mm in-plane, 2mm slice thickness).