Lab 01: YOLOv8 — Real-Time Object Detection
Architecture Overview
YOLOv8 follows the single-stage detection paradigm: one forward pass produces all detections.
Input (640×640×3)
│
Backbone (CSPDarknet + C2f blocks)
• Extracts multi-scale features: P3 (80×80), P4 (40×40), P5 (20×20)
│
Neck (PAN-FPN)
• Path Aggregation Network: fuses features top-down and bottom-up
• Enables detection at 3 scales simultaneously
│
Head (Decoupled head — separate branches for cls and reg)
• Each scale: 3 anchors (actually anchor-free in YOLOv8!)
• Predicts: [x, y, w, h, cls_scores × 80]
│
Post-processing
• Sigmoid activation on class scores
• DFL (Distribution Focal Loss) for box regression
• NMS per class
Key YOLOv8 Improvements over YOLOv5
| Feature | YOLOv5 | YOLOv8 |
|---|---|---|
| Detection paradigm | Anchor-based | Anchor-free |
| Head | Coupled | Decoupled |
| Box loss | CIoU | DFL + CIoU |
| Backbone | CSPDarknet | C2f (CSP with 2 bottlenecks) |
| Augmentation | Mosaic | Close mosaic at epoch 10 |
Loss Functions
Box Regression: CIoU + DFL
IoU: $IoU = \frac{|B_1 \cap B_2|}{|B_1 \cup B_2|}$
CIoU (Complete IoU): adds aspect ratio and center distance terms:
$$\mathcal{L}_{CIoU} = 1 - IoU + \frac{\rho^2(\mathbf{b}, \mathbf{b}^{gt})}{c^2} + \alpha v$$
where $\rho^2$ = squared center distance, $c^2$ = diagonal of enclosing box, $v$ = aspect ratio consistency term.
DFL (Distribution Focal Loss): instead of predicting a single coordinate value, predict a distribution over discrete values. Allows the model to express uncertainty:
$$\mathcal{L}{DFL} = -\sum{i=y_l}^{y_r} \text{softmax}(s_i) \log(s_i)$$
Classification: Binary Cross-Entropy (not softmax!)
YOLOv8 uses BCE on each class independently — allows multi-label detection (one object can be "cat" and "animal" simultaneously). This is different from a softmax classifier.
Training Best Practices
Custom Dataset Preparation (YOLO format)
dataset/
├── images/
│ ├── train/ [*.jpg]
│ └── val/ [*.jpg]
└── labels/
├── train/ [*.txt] ← one file per image
└── val/ [*.txt]
Each .txt file: one line per object:
<class_id> <x_center> <y_center> <width> <height>
All values normalized to [0, 1] relative to image size.
Training Script
from ultralytics import YOLO
# Fine-tune YOLOv8m on custom data
model = YOLO('yolov8m.pt') # pre-trained on COCO
results = model.train(
data='dataset.yaml',
epochs=100,
imgsz=640,
batch=16,
device=0,
optimizer='AdamW',
lr0=1e-3,
lrf=0.01, # final LR = lr0 × lrf
warmup_epochs=3,
cos_lr=True,
augment=True,
close_mosaic=10, # disable mosaic last 10 epochs (stabilizes training)
patience=50, # early stopping
val=True,
save=True,
)
Transfer Learning Tips
- Don't freeze backbone for small datasets (< 1000 images) — YOLOv8 handles this automatically
- Close mosaic augmentation last 10 epochs — mosaic creates unrealistic objects at boundaries, hurts final mAP
- Use rect=True for variable aspect ratio datasets — reduces padding waste
- Multi-scale training — automatically enabled, trains on ±50% of target size
Evaluation Metrics
mAP@0.5 and mAP@0.5:0.95
mAP@0.5: Mean Average Precision at IoU threshold 0.5
mAP@0.5:0.95: COCO metric — average of mAP at IoU 0.5, 0.55, 0.6, ..., 0.95
Interpretation:
mAP@0.5:0.95 > 0.6 → Excellent (publishable)
mAP@0.5:0.95 > 0.4 → Good (production-ready for many applications)
mAP@0.5:0.95 < 0.2 → Needs more data or different architecture
TensorRT Export for Deployment
from ultralytics import YOLO
model = YOLO('runs/detect/train/weights/best.pt')
# Export to TensorRT with FP16 precision
model.export(
format='engine', # TensorRT .engine file
device=0,
half=True, # FP16 — 2× faster, same accuracy
dynamic=False, # static batch for max performance
imgsz=640,
batch=1, # optimize for real-time (batch=1)
)
# Benchmark
import time
model_rt = YOLO('best.engine')
img = torch.randn(1, 3, 640, 640).cuda()
# Warmup
for _ in range(10): model_rt(img)
# Benchmark
times = []
for _ in range(100):
t = time.perf_counter()
model_rt(img)
times.append(time.perf_counter() - t)
print(f"Latency: {np.mean(times)*1000:.1f}ms ± {np.std(times)*1000:.1f}ms")
Interview Questions
Q: How does YOLOv8 anchor-free detection work? What's the advantage?
A: Instead of predicting offsets relative to predefined anchor boxes, YOLOv8 predicts the distance from each grid cell center to the 4 sides of the bounding box (LTRB format). This eliminates the need to manually design anchor sizes, which is fragile — wrong anchor scales lead to poor detection of unusual aspect ratios. Anchor-free is also simpler to implement and generalize to new datasets.
Q: Why does YOLOv8 use a decoupled head?
A: YOLOv3-v5 used a coupled head: the same feature representation predicted both class scores and box coordinates. Classification requires high semantic information (what is it?) while box regression requires precise spatial information (where exactly?). Decoupling allows each branch to specialize, which improves both tasks. The trade-off is slightly higher parameter count and compute, but the accuracy improvement more than justifies it.
Q: How would you handle detection of very small objects (< 5% of image area)?
A: Several strategies: (1) Train at higher resolution (1280×1280 instead of 640×640) — small objects get more pixels, but compute quadruples; (2) Use SAHI (Slicing Aided Hyper Inference): slice the image into overlapping tiles, run detection on each tile, merge detections with NMS; (3) Use P2 feature map (160×160) in addition to P3/P4/P5 — adds a higher-resolution detection head; (4) Data augmentation: copy-paste small objects into training images, random zoom-in on small object regions.