Lab 01: YOLOv8 — Real-Time Object Detection

Architecture Overview

YOLOv8 follows the single-stage detection paradigm: one forward pass produces all detections.

Input (640×640×3)
        │
   Backbone (CSPDarknet + C2f blocks)
   • Extracts multi-scale features: P3 (80×80), P4 (40×40), P5 (20×20)
        │
   Neck (PAN-FPN)
   • Path Aggregation Network: fuses features top-down and bottom-up
   • Enables detection at 3 scales simultaneously
        │
   Head (Decoupled head — separate branches for cls and reg)
   • Each scale: 3 anchors (actually anchor-free in YOLOv8!)
   • Predicts: [x, y, w, h, cls_scores × 80]
        │
   Post-processing
   • Sigmoid activation on class scores
   • DFL (Distribution Focal Loss) for box regression
   • NMS per class

Key YOLOv8 Improvements over YOLOv5

Feature	YOLOv5	YOLOv8
Detection paradigm	Anchor-based	Anchor-free
Head	Coupled	Decoupled
Box loss	CIoU	DFL + CIoU
Backbone	CSPDarknet	C2f (CSP with 2 bottlenecks)
Augmentation	Mosaic	Close mosaic at epoch 10

Loss Functions

Box Regression: CIoU + DFL

IoU: $IoU = \frac{|B_1 \cap B_2|}{|B_1 \cup B_2|}$

CIoU (Complete IoU): adds aspect ratio and center distance terms:

$$\mathcal{L}_{CIoU} = 1 - IoU + \frac{\rho^2(\mathbf{b}, \mathbf{b}^{gt})}{c^2} + \alpha v$$

where $\rho^2$ = squared center distance, $c^2$ = diagonal of enclosing box, $v$ = aspect ratio consistency term.

DFL (Distribution Focal Loss): instead of predicting a single coordinate value, predict a distribution over discrete values. Allows the model to express uncertainty:

$$\mathcal{L}{DFL} = -\sum{i=y_l}^{y_r} \text{softmax}(s_i) \log(s_i)$$

Classification: Binary Cross-Entropy (not softmax!)

YOLOv8 uses BCE on each class independently — allows multi-label detection (one object can be "cat" and "animal" simultaneously). This is different from a softmax classifier.

Training Best Practices

Custom Dataset Preparation (YOLO format)

dataset/
├── images/
│   ├── train/ [*.jpg]
│   └── val/   [*.jpg]
└── labels/
    ├── train/ [*.txt]  ← one file per image
    └── val/   [*.txt]

Each .txt file: one line per object:

<class_id> <x_center> <y_center> <width> <height>

All values normalized to [0, 1] relative to image size.

Training Script

from ultralytics import YOLO

# Fine-tune YOLOv8m on custom data
model = YOLO('yolov8m.pt')  # pre-trained on COCO
results = model.train(
    data='dataset.yaml',
    epochs=100,
    imgsz=640,
    batch=16,
    device=0,
    optimizer='AdamW',
    lr0=1e-3,
    lrf=0.01,        # final LR = lr0 × lrf
    warmup_epochs=3,
    cos_lr=True,
    augment=True,
    close_mosaic=10, # disable mosaic last 10 epochs (stabilizes training)
    patience=50,     # early stopping
    val=True,
    save=True,
)

Transfer Learning Tips

Don't freeze backbone for small datasets (< 1000 images) — YOLOv8 handles this automatically
Close mosaic augmentation last 10 epochs — mosaic creates unrealistic objects at boundaries, hurts final mAP
Use rect=True for variable aspect ratio datasets — reduces padding waste
Multi-scale training — automatically enabled, trains on ±50% of target size

Evaluation Metrics

mAP@0.5 and mAP@0.5:0.95

mAP@0.5:     Mean Average Precision at IoU threshold 0.5
mAP@0.5:0.95: COCO metric — average of mAP at IoU 0.5, 0.55, 0.6, ..., 0.95

Interpretation:
  mAP@0.5:0.95 > 0.6  → Excellent (publishable)
  mAP@0.5:0.95 > 0.4  → Good (production-ready for many applications)
  mAP@0.5:0.95 < 0.2  → Needs more data or different architecture

TensorRT Export for Deployment

from ultralytics import YOLO

model = YOLO('runs/detect/train/weights/best.pt')

# Export to TensorRT with FP16 precision
model.export(
    format='engine',   # TensorRT .engine file
    device=0,
    half=True,         # FP16 — 2× faster, same accuracy
    dynamic=False,     # static batch for max performance
    imgsz=640,
    batch=1,           # optimize for real-time (batch=1)
)

# Benchmark
import time
model_rt = YOLO('best.engine')
img = torch.randn(1, 3, 640, 640).cuda()
# Warmup
for _ in range(10): model_rt(img)
# Benchmark
times = []
for _ in range(100):
    t = time.perf_counter()
    model_rt(img)
    times.append(time.perf_counter() - t)
print(f"Latency: {np.mean(times)*1000:.1f}ms ± {np.std(times)*1000:.1f}ms")

Interview Questions

Q: How does YOLOv8 anchor-free detection work? What's the advantage?

A: Instead of predicting offsets relative to predefined anchor boxes, YOLOv8 predicts the distance from each grid cell center to the 4 sides of the bounding box (LTRB format). This eliminates the need to manually design anchor sizes, which is fragile — wrong anchor scales lead to poor detection of unusual aspect ratios. Anchor-free is also simpler to implement and generalize to new datasets.

Q: Why does YOLOv8 use a decoupled head?

A: YOLOv3-v5 used a coupled head: the same feature representation predicted both class scores and box coordinates. Classification requires high semantic information (what is it?) while box regression requires precise spatial information (where exactly?). Decoupling allows each branch to specialize, which improves both tasks. The trade-off is slightly higher parameter count and compute, but the accuracy improvement more than justifies it.

Q: How would you handle detection of very small objects (< 5% of image area)?

A: Several strategies: (1) Train at higher resolution (1280×1280 instead of 640×640) — small objects get more pixels, but compute quadruples; (2) Use SAHI (Slicing Aided Hyper Inference): slice the image into overlapping tiles, run detection on each tile, merge detections with NMS; (3) Use P2 feature map (160×160) in addition to P3/P4/P5 — adds a higher-resolution detection head; (4) Data augmentation: copy-paste small objects into training images, random zoom-in on small object regions.

AI Engineer — Role-Based Learning Hub