Lab 03 — Model Evaluation Metrics

Phase 2: ML Fundamentals | Week 5-6

Building a model is easy. Knowing if it actually works — and where it fails — is the job. Master these metrics and you'll catch problems that loss curves will never show you.

Learning Objectives

Implement confusion matrix, Precision, Recall, F1 from scratch
Build ROC curves and understand AUC interpretation
Compute IoU and mAP (COCO 101-point interpolation) from scratch
Know when to use each metric and how to defend your choices in interviews

Theory

Classification Metrics

Given a confusion matrix for class $c$:

	Predicted Positive	Predicted Negative
Actually Positive	TP	FN
Actually Negative	FP	TN

$$\text{Precision} = \frac{TP}{TP + FP} \quad \text{Recall} = \frac{TP}{TP + FN}$$

$$F_1 = \frac{2 \cdot P \cdot R}{P + R} \quad F_\beta = \frac{(1+\beta^2) \cdot P \cdot R}{\beta^2 \cdot P + R}$$

$\beta > 1$: weight recall more (e.g., cancer detection — missing a case is costly)
$\beta < 1$: weight precision more (e.g., spam filter — false positives destroy trust)

ROC Curve & AUC

Sweep threshold $t$ from 1 → 0, compute TPR and FPR at each:

$$TPR = \frac{TP}{TP + FN} \quad FPR = \frac{FP}{FP + TN}$$

AUC = 1.0: perfect separation
AUC = 0.5: random classifier
AUC < 0.5: worse than random (check label encoding!)

When to use ROC vs PR curve: Use PR curve when classes are heavily imbalanced. ROC can look optimistic on imbalanced data because TN is large, keeping FPR small.

IoU (Intersection over Union)

$$\text{IoU}(A, B) = \frac{|A \cap B|}{|A \cup B|} = \frac{|A \cap B|}{|A| + |B| - |A \cap B|}$$

For boxes $A = [x_1^A, y_1^A, x_2^A, y_2^A]$:

$$x_1^I = \max(x_1^A, x_1^B), \quad x_2^I = \min(x_2^A, x_2^B)$$ $$\text{inter} = \max(0, x_2^I - x_1^I) \cdot \max(0, y_2^I - y_1^I)$$

mAP — Mean Average Precision

For each class $c$:

Sort all detections by confidence score (descending)
For each detection: TP if IoU with a GT box ≥ threshold, else FP
Compute precision/recall curve
Compute AP using 101-point COCO interpolation:

$$AP = \frac{1}{101} \sum_{r \in {0, 0.01, ..., 1.0}} \max_{\tilde{r} \geq r} P(\tilde{r})$$

$$\text{mAP} = \frac{1}{C} \sum_{c=1}^C AP_c$$

mAP@0.5: threshold = 0.5. Classic VOC metric.
mAP@0.5:0.95: average over IoU thresholds [0.5, 0.55, ..., 0.95]. Stricter COCO metric.

What the Lab Covers

Function	Concept	Interview Frequency
`confusion_matrix()`	From-scratch implementation	★★★★★
`precision_recall_f1()`	Macro/micro averaging	★★★★★
`roc_auc_from_scratch()`	Threshold sweep	★★★★
`iou()`	Vectorized box IoU	★★★★★
`compute_ap()`	101-point interpolation	★★★★★
`map_by_class()`	Full mAP computation	★★★★
`calibration_curve()`	Reliability diagram	★★★

Pandas in Practice

import pandas as pd

# Typical evaluation workflow with pandas
results_df = pd.DataFrame({
    'image_id': ids,
    'class': class_names,
    'confidence': scores,
    'tp': tp_flags,
    'fp': fp_flags,
})

# Per-class breakdown
per_class = results_df.groupby('class').agg(
    precision=('tp', lambda x: x.sum() / len(x)),
    recall=('tp', 'mean'),
    n_detections=('tp', 'count'),
)
print(per_class)

Q: Your model has 99% accuracy on a medical dataset. Is it good? A: Probably not. If 99% of samples are negative (healthy), a model that always predicts negative achieves 99% accuracy. Use recall (sensitivity) and precision, or AUC-PR.

Q: Explain the precision-recall tradeoff. A: Lowering the confidence threshold increases recall (fewer FN) but decreases precision (more FP). The tradeoff is governed by the score distribution overlap between positives and negatives.

Q: mAP@0.5:0.95 vs mAP@0.5 — which should you optimize? A: mAP@0.5:0.95 is the primary COCO metric and is harder — it requires tight localization. mAP@0.5 is the VOC metric. For production, mAP@0.5 is often more practically meaningful. Always report both.

Q: How do you handle class imbalance in multi-class classification? A: (1) Use macro-averaged F1 (treats all classes equally). (2) Use weighted loss (inverse frequency or focal loss). (3) Oversample rare classes (SMOTE for tabular, copy-paste augmentation for detection).

Run

pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/

AI Engineer — Role-Based Learning Hub