Lab 02: Data Preprocessing & Augmentation for CV

What You'll Learn

Albumentations for fast, GPU-ready image augmentation
Handling class imbalance: oversampling, undersampling, class weights
Focal Loss — designed specifically for class imbalance in detection
Data pipeline best practices (avoid augmentation leakage)

Why Augmentation Matters

A model trained on 10,000 images will generalize far better if augmented to behave as if it saw 500,000 diverse examples. Augmentation is the single highest-ROI technique in CV.

Augmentation Categories

Category	Operations	Preserves Label?
Geometric	flip, rotate, scale, crop, perspective	Usually yes
Photometric	brightness, contrast, hue, saturation	Always
Noise/Blur	Gaussian noise, motion blur, JPEG artifacts	Always
Regularization	Cutout, CutMix, MixUp	Requires label mixing
Domain-specific	Elastic deformation (medical), rain/fog (driving)	Yes

Critical rule: Apply augmentation only to training data. Validation and test sets should use only normalization (and possibly center crop for classification).

Albumentations

Albumentations is the de facto standard for CV augmentation. It's 3-10× faster than torchvision transforms because it operates on NumPy arrays and is optimized with OpenCV + Cython.

import albumentations as A
from albumentations.pytorch import ToTensorV2

train_transform = A.Compose([
    A.RandomResizedCrop(height=224, width=224, scale=(0.8, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1, p=0.8),
    A.GaussNoise(var_limit=(10, 50), p=0.3),
    A.ShiftScaleRotate(shift_limit=0.05, scale_limit=0.1, rotate_limit=15, p=0.5),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
])

For object detection, pass bounding boxes through the transform:

transform = A.Compose([
    A.HorizontalFlip(p=0.5),
    A.RandomBrightnessContrast(p=0.3),
], bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))

transformed = transform(image=image, bboxes=bboxes, class_labels=labels)

Class Imbalance

The most common problem in real-world CV datasets: 95% negative (background), 5% positive (defect/person).

Strategy Comparison

Method	How	Pro	Con
Class weights	`weight[c] = N / (K × N_c)`	No data needed	Doesn't help with hard negatives
Oversampling (SMOTE)	Synthesize minority examples	Fixes marginal distribution	Doesn't apply to images
Undersampling	Randomly drop majority	Fast	Loses information
Focal Loss	Down-weight easy negatives	Best for detection	Needs tuning of γ
Data collection	More minority examples	Correct fix	Expensive

Focal Loss Derivation

Standard cross-entropy loss: $CE(p_t) = -\log(p_t)$

For a well-classified example ($p_t = 0.9$): $CE = -\log(0.9) = 0.105$

In a dataset with 99% negatives and batch_size=256:

~253 easy negatives contribute loss ≈ 0.105 each
~3 positives contribute loss ≈ 2.3 each (hard case)
Total loss dominated by easy negatives → gradients don't learn from hard cases

Focal Loss (Lin et al., 2017, RetinaNet paper):

$$FL(p_t) = -\alpha_t(1-p_t)^\gamma \log(p_t)$$

$(1-p_t)^\gamma$: modulating factor — if $p_t=0.9$ (easy), $(1-0.9)^2 = 0.01$ → loss reduced 100×
If $p_t=0.1$ (hard), $(1-0.1)^2 = 0.81$ → loss barely reduced
$\gamma=2$ is the sweet spot (proven by RetinaNet paper)
$\alpha_t$: class balancing weight (typically 0.25 for positives)

Interview Questions

Q: What's the difference between augmentation during training vs test-time augmentation (TTA)?

A: During training, augmentation artificially increases dataset diversity to reduce overfitting. TTA applies augmentation at inference: make N augmented versions of the test image, run inference on all N, and average/ensemble the predictions. TTA typically gives 1-3% accuracy improvement with no training cost. Common TTA: horizontal flip, 5-crop (4 corners + center). The tradeoff: N× inference cost.

Q: Your detection model has 98% background, 2% objects. Training loss is 0.05 after 1 epoch but recall is 0. Why?

A: The model learned to predict everything as background — this achieves 98% accuracy but 0% recall. The cross-entropy loss is dominated by easy negatives. Solutions: (1) Focal loss with γ=2, α=0.25, (2) Hard negative mining (only backprop the top-k hardest negative examples), (3) Class-balanced sampling (ensure each batch has 50% positive examples), (4) Use class-weighted loss. In practice, modern detectors (YOLO, RetinaNet) all use focal loss or anchor-based balancing for this reason.

AI Engineer — Role-Based Learning Hub