Lab 02: Data Preprocessing & Augmentation for CV
What You'll Learn
- Albumentations for fast, GPU-ready image augmentation
- Handling class imbalance: oversampling, undersampling, class weights
- Focal Loss — designed specifically for class imbalance in detection
- Data pipeline best practices (avoid augmentation leakage)
Why Augmentation Matters
A model trained on 10,000 images will generalize far better if augmented to behave as if it saw 500,000 diverse examples. Augmentation is the single highest-ROI technique in CV.
Augmentation Categories
| Category | Operations | Preserves Label? |
|---|---|---|
| Geometric | flip, rotate, scale, crop, perspective | Usually yes |
| Photometric | brightness, contrast, hue, saturation | Always |
| Noise/Blur | Gaussian noise, motion blur, JPEG artifacts | Always |
| Regularization | Cutout, CutMix, MixUp | Requires label mixing |
| Domain-specific | Elastic deformation (medical), rain/fog (driving) | Yes |
Critical rule: Apply augmentation only to training data. Validation and test sets should use only normalization (and possibly center crop for classification).
Albumentations
Albumentations is the de facto standard for CV augmentation. It's 3-10× faster than torchvision transforms because it operates on NumPy arrays and is optimized with OpenCV + Cython.
import albumentations as A
from albumentations.pytorch import ToTensorV2
train_transform = A.Compose([
A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
A.HorizontalFlip(p=0.5),
A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.8),
A.GaussNoise(var_limit=(10, 50), p=0.3),
A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1, rotate_limit=15, p=0.5),
A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
ToTensorV2(),
])
For object detection, pass bounding boxes through the transform:
transform = A.Compose([
A.HorizontalFlip(p=0.5),
A.RandomBrightnessContrast(p=0.3),
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))
transformed = transform(image=image, bboxes=bboxes, class_labels=labels)
Class Imbalance
The most common problem in real-world CV datasets: 95% negative (background), 5% positive (defect/person).
Strategy Comparison
| Method | How | Pro | Con |
|---|---|---|---|
| Class weights | weight[c] = N / (K × N_c) | No data needed | Doesn't help with hard negatives |
| Oversampling (SMOTE) | Synthesize minority examples | Fixes marginal distribution | Doesn't apply to images |
| Undersampling | Randomly drop majority | Fast | Loses information |
| Focal Loss | Down-weight easy negatives | Best for detection | Needs tuning of γ |
| Data collection | More minority examples | Correct fix | Expensive |
Focal Loss Derivation
Standard cross-entropy loss: $CE(p_t) = -\log(p_t)$
For a well-classified example ($p_t = 0.9$): $CE = -\log(0.9) = 0.105$
In a dataset with 99% negatives and batch_size=256:
- ~253 easy negatives contribute loss ≈ 0.105 each
- ~3 positives contribute loss ≈ 2.3 each (hard case)
- Total loss dominated by easy negatives → gradients don't learn from hard cases
Focal Loss (Lin et al., 2017, RetinaNet paper):
$$FL(p_t) = -\alpha_t(1-p_t)^\gamma \log(p_t)$$
- $(1-p_t)^\gamma$: modulating factor — if $p_t=0.9$ (easy), $(1-0.9)^2 = 0.01$ → loss reduced 100×
- If $p_t=0.1$ (hard), $(1-0.1)^2 = 0.81$ → loss barely reduced
- $\gamma=2$ is the sweet spot (proven by RetinaNet paper)
- $\alpha_t$: class balancing weight (typically 0.25 for positives)
Interview Questions
Q: What's the difference between augmentation during training vs test-time augmentation (TTA)?
A: During training, augmentation artificially increases dataset diversity to reduce overfitting. TTA applies augmentation at inference: make N augmented versions of the test image, run inference on all N, and average/ensemble the predictions. TTA typically gives 1-3% accuracy improvement with no training cost. Common TTA: horizontal flip, 5-crop (4 corners + center). The tradeoff: N× inference cost.
Q: Your detection model has 98% background, 2% objects. Training loss is 0.05 after 1 epoch but recall is 0. Why?
A: The model learned to predict everything as background — this achieves 98% accuracy but 0% recall. The cross-entropy loss is dominated by easy negatives. Solutions: (1) Focal loss with γ=2, α=0.25, (2) Hard negative mining (only backprop the top-k hardest negative examples), (3) Class-balanced sampling (ensure each batch has 50% positive examples), (4) Use class-weighted loss. In practice, modern detectors (YOLO, RetinaNet) all use focal loss or anchor-based balancing for this reason.