Lab 04 — Pandas & Scikit-Learn Deep Dive
Phase 2: ML Fundamentals | Week 6
Pandas and sklearn are the invisible backbone of every CV production system. Data cleaning, feature pipelines, hyperparameter search, and experiment tracking all run through these libraries. You will be tested on them in every ML interview.
Learning Objectives
- Master pandas for annotation management, EDA, and experiment tracking
- Build sklearn
Pipeline+ColumnTransformerfor reproducible feature engineering - Implement cross-validation strategies for imbalanced datasets
- Run
GridSearchCV/RandomizedSearchCVand analyze results - Build a full annotation analysis workflow from raw CSV to insights
Part 1: Pandas for CV Data Science
Reading & Exploring Annotation Files
import pandas as pd
df = pd.read_csv("annotations.csv")
# Shape, types, nulls
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
print(df.describe()) # stats for numeric columns
print(df["class"].value_counts())
# Column selection
bbox_cols = ["xmin", "ymin", "xmax", "ymax"]
boxes = df[bbox_cols] # DataFrame
classes = df["class"] # Series
Feature Engineering on Annotations
# Derived bbox features — all in one assign call (chainable)
df = df.assign(
width = df["xmax"] - df["xmin"],
height = df["ymax"] - df["ymin"],
area = lambda d: d["width"] * d["height"],
aspect_ratio = lambda d: d["width"] / d["height"].clip(lower=1e-6),
cx = lambda d: (d["xmin"] + d["xmax"]) / 2,
cy = lambda d: (d["ymin"] + d["ymax"]) / 2,
normalized_area = lambda d: d["area"] / (d["img_w"] * d["img_h"]),
)
GroupBy — The Workhorse Operation
# Per-image statistics
per_image = df.groupby("image_id").agg(
n_objects = ("class", "count"),
n_classes = ("class", "nunique"),
mean_area = ("area", "mean"),
classes_list = ("class", list),
).reset_index()
# Per-class statistics
per_class = df.groupby("class").agg(
count = ("image_id", "count"),
mean_area = ("area", "mean"),
median_conf = ("confidence", "median"),
images = ("image_id", "nunique"),
).sort_values("count", ascending=False)
# Pivot: class × image_id — useful for co-occurrence analysis
pivot = df.pivot_table(
index="image_id", columns="class",
values="confidence", aggfunc="max", fill_value=0
)
Joining Predictions with Ground Truth
preds = pd.read_csv("predictions.csv") # image_id, class, confidence, bbox...
gt = pd.read_csv("ground_truth.csv") # image_id, class, bbox...
# Merge on image_id to align per-image
merged = pd.merge(preds, gt, on="image_id", suffixes=("_pred", "_gt"))
# Find missed classes (FN at class level)
pred_classes = set(preds["class"].unique())
gt_classes = set(gt["class"].unique())
missed = gt_classes - pred_classes
print(f"Classes never predicted: {missed}")
# Error analysis: highest-area false positives
fp_df = preds[(preds["iou_with_gt"] < 0.5) & (preds["confidence"] > 0.7)]
fp_df.nlargest(20, "area")
Cleaning & Validation
# Remove out-of-bounds boxes
df = df[
(df["xmin"] >= 0) & (df["ymin"] >= 0) &
(df["xmax"] <= df["img_w"]) & (df["ymax"] <= df["img_h"]) &
(df["xmin"] < df["xmax"]) & (df["ymin"] < df["ymax"])
]
# Remove tiny boxes (likely annotation noise)
df = df[df["area"] > 100]
# Handle missing confidence scores
df["confidence"] = df["confidence"].fillna(1.0) # GT has no confidence → 1
# Deduplicate (exact duplicate rows)
df = df.drop_duplicates()
apply / transform for Custom Logic
# apply: returns one value per group
iou_stats = df.groupby("class")["iou"].apply(
lambda x: pd.Series({
"ap50": (x > 0.5).mean(),
"ap75": (x > 0.75).mean(),
})
)
# transform: returns same-length Series (useful for adding group stats to rows)
df["class_mean_area"] = df.groupby("class")["area"].transform("mean")
df["area_vs_class_mean"] = df["area"] / df["class_mean_area"]
Part 2: Scikit-Learn Pipelines
Why Pipelines?
A Pipeline chains preprocessing + model into one object. Benefits:
- Prevents data leakage (fit scaler ONLY on train, applies to test)
- One
fit()/predict()call - Fully compatible with
GridSearchCV
raw CSV
│
▼
ColumnTransformer
├── numeric: [impute → scale]
└── categorical: [impute → one-hot]
│
▼
Classifier / Regressor
Building a Pipeline
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
numeric_features = ["area", "aspect_ratio", "cx", "cy", "width", "height"]
categoric_features = ["dataset_split", "scene_type"]
numeric_transformer = Pipeline([
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler()),
])
categorical_transformer = Pipeline([
("imputer", SimpleImputer(strategy="most_frequent")),
("onehot", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])
preprocessor = ColumnTransformer([
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categoric_features),
])
pipeline = Pipeline([
("preprocessor", preprocessor),
("classifier", RandomForestClassifier(n_estimators=100, random_state=42)),
])
pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)
Custom Sklearn Transformer
from sklearn.base import BaseEstimator, TransformerMixin
class BBoxFeatureExtractor(BaseEstimator, TransformerMixin):
"""Extracts geometric features from raw bounding box columns."""
def __init__(self, img_w=1920, img_h=1080):
self.img_w = img_w
self.img_h = img_h
def fit(self, X, y=None):
return self # stateless
def transform(self, X):
df = pd.DataFrame(X, columns=["xmin", "ymin", "xmax", "ymax"])
w = df["xmax"] - df["xmin"]
h = df["ymax"] - df["ymin"]
return pd.DataFrame({
"area": w * h,
"aspect_ratio": w / h.clip(lower=1e-6),
"cx": (df["xmin"] + df["xmax"]) / 2 / self.img_w,
"cy": (df["ymin"] + df["ymax"]) / 2 / self.img_h,
"normalized_area": (w * h) / (self.img_w * self.img_h),
}).to_numpy()
Cross-Validation Strategies
from sklearn.model_selection import (
StratifiedKFold, GroupKFold, StratifiedGroupKFold, cross_val_score
)
# Standard: stratified to preserve class proportions
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=skf, scoring="f1_macro")
# Group: no image appears in both train and val (critical for CV — prevents leakage!)
gkf = GroupKFold(n_splits=5)
groups = df["image_id"].to_numpy() # each bbox belongs to an image
scores = cross_val_score(pipeline, X, y, cv=gkf, groups=groups, scoring="f1_macro")
# StratifiedGroupKFold: both stratified + group-aware
sgkf = StratifiedGroupKFold(n_splits=5)
scores = cross_val_score(pipeline, X, y, cv=sgkf, groups=groups, scoring="f1_macro")
GridSearchCV / RandomizedSearchCV
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_dist = {
"classifier__n_estimators": randint(50, 300),
"classifier__max_depth": [None, 5, 10, 20],
"classifier__min_samples_leaf": randint(1, 20),
"preprocessor__num__imputer__strategy": ["mean", "median"],
}
search = RandomizedSearchCV(
pipeline, param_dist,
n_iter=30, cv=5, scoring="f1_macro",
n_jobs=-1, random_state=42, verbose=1,
)
search.fit(X_train, y_train)
# Results as DataFrame for analysis
results_df = pd.DataFrame(search.cv_results_)
results_df.sort_values("mean_test_score", ascending=False).head(10)
Feature Importance + SHAP
# Get feature names after pipeline transforms
feature_names = (
numeric_features
+ pipeline["preprocessor"].transformers_[1][1]["onehot"]
.get_feature_names_out(categoric_features).tolist()
)
importances = pipeline["classifier"].feature_importances_
feat_df = (
pd.DataFrame({"feature": feature_names, "importance": importances})
.sort_values("importance", ascending=False)
.head(15)
)
Interview Questions
Q: What is data leakage? Give a concrete example in a CV context.
A: Data leakage is when information from the test set influences training. Example: if you fit a StandardScaler on the full dataset and then split, the scaler's mean/std were computed with test data — test distribution influenced the preprocessing. Fix: always fit preprocessing ONLY on training data. sklearn Pipeline prevents this automatically.
Q: Why use GroupKFold instead of StratifiedKFold for object detection datasets?
A: Object detection datasets have multiple bounding boxes per image. If the same image appears in both train and val folds, the model has effectively "seen" those images during training (because features extracted from the same image are highly correlated). GroupKFold groups by image_id, ensuring all boxes from one image are in the same fold.
Q: Write a pandas operation to find the top-5 most confused class pairs from a prediction DataFrame.
confused = (
df[df["pred_class"] != df["true_class"]]
.groupby(["true_class", "pred_class"])
.size()
.sort_values(ascending=False)
.head(5)
.reset_index(name="count")
)
Q: A Pipeline has steps [('scaler', StandardScaler()), ('model', SVC())]. How do you access the scaler's mean_ after fitting?
pipeline.fit(X_train, y_train)
means = pipeline.named_steps["scaler"].mean_
# or
means = pipeline[0].mean_
Q: What is the difference between fit_transform() and transform() in sklearn?
A: fit_transform(X) is equivalent to fit(X).transform(X) — it learns parameters from X AND applies the transformation. transform(X) only applies previously learned parameters. Never call fit_transform on test data — always transform only.
Q: How does ColumnTransformer handle columns not listed in any transformer?
A: By default, unlisted columns are dropped (remainder='drop'). Set remainder='passthrough' to keep them as-is. You can also set remainder=SomeTransformer() to apply a specific transformation.
Run
pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/