Lab 03 — Model Evaluation Metrics
Phase 2: ML Fundamentals | Week 5-6
Building a model is easy. Knowing if it actually works — and where it fails — is the job. Master these metrics and you'll catch problems that loss curves will never show you.
Learning Objectives
- Implement confusion matrix, Precision, Recall, F1 from scratch
- Build ROC curves and understand AUC interpretation
- Compute IoU and mAP (COCO 101-point interpolation) from scratch
- Know when to use each metric and how to defend your choices in interviews
Theory
Classification Metrics
Given a confusion matrix for class $c$:
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | TP | FN |
| Actually Negative | FP | TN |
$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{Recall} = \frac{TP}{TP + FN}$$
$$F_1 = \frac{2 \cdot P \cdot R}{P + R} \quad F_\beta = \frac{(1+\beta^2) \cdot P \cdot R}{\beta^2 \cdot P + R}$$
- $\beta > 1$: weight recall more (e.g., cancer detection — missing a case is costly)
- $\beta < 1$: weight precision more (e.g., spam filter — false positives destroy trust)
ROC Curve & AUC
Sweep threshold $t$ from 1 → 0, compute TPR and FPR at each:
$$TPR = \frac{TP}{TP + FN} \quad FPR = \frac{FP}{FP + TN}$$
- AUC = 1.0: perfect separation
- AUC = 0.5: random classifier
- AUC < 0.5: worse than random (check label encoding!)
When to use ROC vs PR curve: Use PR curve when classes are heavily imbalanced. ROC can look optimistic on imbalanced data because TN is large, keeping FPR small.
IoU (Intersection over Union)
$$\text{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}$$
For boxes $A = [x_1^A, y_1^A, x_2^A, y_2^A]$:
$$x_1^I = \max(x_1^A, x_1^B), \quad x_2^I = \min(x_2^A, x_2^B)$$ $$\text{inter} = \max(0, x_2^I - x_1^I) \cdot \max(0, y_2^I - y_1^I)$$
mAP — Mean Average Precision
For each class $c$:
- Sort all detections by confidence score (descending)
- For each detection: TP if IoU with a GT box ≥ threshold, else FP
- Compute precision/recall curve
- Compute AP using 101-point COCO interpolation:
$$AP = \frac{1}{101} \sum_{r \in {0, 0.01, ..., 1.0}} \max_{\tilde{r} \geq r} P(\tilde{r})$$
$$\text{mAP} = \frac{1}{C} \sum_{c=1}^C AP_c$$
mAP@0.5: threshold = 0.5. Classic VOC metric.
mAP@0.5:0.95: average over IoU thresholds [0.5, 0.55, ..., 0.95]. Stricter COCO metric.
What the Lab Covers
| Function | Concept | Interview Frequency |
|---|---|---|
confusion_matrix() | From-scratch implementation | ★★★★★ |
precision_recall_f1() | Macro/micro averaging | ★★★★★ |
roc_auc_from_scratch() | Threshold sweep | ★★★★ |
iou() | Vectorized box IoU | ★★★★★ |
compute_ap() | 101-point interpolation | ★★★★★ |
map_by_class() | Full mAP computation | ★★★★ |
calibration_curve() | Reliability diagram | ★★★ |
Pandas in Practice
import pandas as pd
# Typical evaluation workflow with pandas
results_df = pd.DataFrame({
'image_id': ids,
'class': class_names,
'confidence': scores,
'tp': tp_flags,
'fp': fp_flags,
})
# Per-class breakdown
per_class = results_df.groupby('class').agg(
precision=('tp', lambda x: x.sum() / len(x)),
recall=('tp', 'mean'),
n_detections=('tp', 'count'),
)
print(per_class)
Interview Questions
Q: Your model has 99% accuracy on a medical dataset. Is it good? A: Probably not. If 99% of samples are negative (healthy), a model that always predicts negative achieves 99% accuracy. Use recall (sensitivity) and precision, or AUC-PR.
Q: Explain the precision-recall tradeoff. A: Lowering the confidence threshold increases recall (fewer FN) but decreases precision (more FP). The tradeoff is governed by the score distribution overlap between positives and negatives.
Q: mAP@0.5:0.95 vs mAP@0.5 — which should you optimize? A: mAP@0.5:0.95 is the primary COCO metric and is harder — it requires tight localization. mAP@0.5 is the VOC metric. For production, mAP@0.5 is often more practically meaningful. Always report both.
Q: How do you handle class imbalance in multi-class classification? A: (1) Use macro-averaged F1 (treats all classes equally). (2) Use weighted loss (inverse frequency or focal loss). (3) Oversample rare classes (SMOTE for tabular, copy-paste augmentation for detection).
Run
pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/