Lab 03 — Model Evaluation Metrics

Phase 2: ML Fundamentals | Week 5-6

Building a model is easy. Knowing if it actually works — and where it fails — is the job. Master these metrics and you'll catch problems that loss curves will never show you.


Learning Objectives

  • Implement confusion matrix, Precision, Recall, F1 from scratch
  • Build ROC curves and understand AUC interpretation
  • Compute IoU and mAP (COCO 101-point interpolation) from scratch
  • Know when to use each metric and how to defend your choices in interviews

Theory

Classification Metrics

Given a confusion matrix for class $c$:

Predicted PositivePredicted Negative
Actually PositiveTPFN
Actually NegativeFPTN

$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{Recall} = \frac{TP}{TP + FN}$$

$$F_1 = \frac{2 \cdot P \cdot R}{P + R} \quad F_\beta = \frac{(1+\beta^2) \cdot P \cdot R}{\beta^2 \cdot P + R}$$

  • $\beta > 1$: weight recall more (e.g., cancer detection — missing a case is costly)
  • $\beta < 1$: weight precision more (e.g., spam filter — false positives destroy trust)

ROC Curve & AUC

Sweep threshold $t$ from 1 → 0, compute TPR and FPR at each:

$$TPR = \frac{TP}{TP + FN} \quad FPR = \frac{FP}{FP + TN}$$

  • AUC = 1.0: perfect separation
  • AUC = 0.5: random classifier
  • AUC < 0.5: worse than random (check label encoding!)

When to use ROC vs PR curve: Use PR curve when classes are heavily imbalanced. ROC can look optimistic on imbalanced data because TN is large, keeping FPR small.

IoU (Intersection over Union)

$$\text{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}$$

For boxes $A = [x_1^A, y_1^A, x_2^A, y_2^A]$:

$$x_1^I = \max(x_1^A, x_1^B), \quad x_2^I = \min(x_2^A, x_2^B)$$ $$\text{inter} = \max(0, x_2^I - x_1^I) \cdot \max(0, y_2^I - y_1^I)$$

mAP — Mean Average Precision

For each class $c$:

  1. Sort all detections by confidence score (descending)
  2. For each detection: TP if IoU with a GT box ≥ threshold, else FP
  3. Compute precision/recall curve
  4. Compute AP using 101-point COCO interpolation:

$$AP = \frac{1}{101} \sum_{r \in {0, 0.01, ..., 1.0}} \max_{\tilde{r} \geq r} P(\tilde{r})$$

$$\text{mAP} = \frac{1}{C} \sum_{c=1}^C AP_c$$

mAP@0.5: threshold = 0.5. Classic VOC metric.
mAP@0.5:0.95: average over IoU thresholds [0.5, 0.55, ..., 0.95]. Stricter COCO metric.


What the Lab Covers

FunctionConceptInterview Frequency
confusion_matrix()From-scratch implementation★★★★★
precision_recall_f1()Macro/micro averaging★★★★★
roc_auc_from_scratch()Threshold sweep★★★★
iou()Vectorized box IoU★★★★★
compute_ap()101-point interpolation★★★★★
map_by_class()Full mAP computation★★★★
calibration_curve()Reliability diagram★★★

Pandas in Practice

import pandas as pd

# Typical evaluation workflow with pandas
results_df = pd.DataFrame({
    'image_id': ids,
    'class': class_names,
    'confidence': scores,
    'tp': tp_flags,
    'fp': fp_flags,
})

# Per-class breakdown
per_class = results_df.groupby('class').agg(
    precision=('tp', lambda x: x.sum() / len(x)),
    recall=('tp', 'mean'),
    n_detections=('tp', 'count'),
)
print(per_class)

Interview Questions

Q: Your model has 99% accuracy on a medical dataset. Is it good? A: Probably not. If 99% of samples are negative (healthy), a model that always predicts negative achieves 99% accuracy. Use recall (sensitivity) and precision, or AUC-PR.

Q: Explain the precision-recall tradeoff. A: Lowering the confidence threshold increases recall (fewer FN) but decreases precision (more FP). The tradeoff is governed by the score distribution overlap between positives and negatives.

Q: mAP@0.5:0.95 vs mAP@0.5 — which should you optimize? A: mAP@0.5:0.95 is the primary COCO metric and is harder — it requires tight localization. mAP@0.5 is the VOC metric. For production, mAP@0.5 is often more practically meaningful. Always report both.

Q: How do you handle class imbalance in multi-class classification? A: (1) Use macro-averaged F1 (treats all classes equally). (2) Use weighted loss (inverse frequency or focal loss). (3) Oversample rare classes (SMOTE for tabular, copy-paste augmentation for detection).


Run

pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/