Lab 04 — Pandas & Scikit-Learn Deep Dive

Phase 2: ML Fundamentals | Week 6

Pandas and sklearn are the invisible backbone of every CV production system. Data cleaning, feature pipelines, hyperparameter search, and experiment tracking all run through these libraries. You will be tested on them in every ML interview.

Learning Objectives

Master pandas for annotation management, EDA, and experiment tracking
Build sklearn Pipeline + ColumnTransformer for reproducible feature engineering
Implement cross-validation strategies for imbalanced datasets
Run GridSearchCV / RandomizedSearchCV and analyze results
Build a full annotation analysis workflow from raw CSV to insights

Part 1: Pandas for CV Data Science

Reading & Exploring Annotation Files

import pandas as pd

df = pd.read_csv("annotations.csv")

# Shape, types, nulls
print(df.shape)
print(df.dtypes)
print(df.isnull().sum())
print(df.describe())          # stats for numeric columns
print(df["class"].value_counts())

# Column selection
bbox_cols = ["xmin", "ymin", "xmax", "ymax"]
boxes = df[bbox_cols]         # DataFrame
classes = df["class"]         # Series

Feature Engineering on Annotations

# Derived bbox features — all in one assign call (chainable)
df = df.assign(
    width      = df["xmax"] - df["xmin"],
    height     = df["ymax"] - df["ymin"],
    area       = lambda d: d["width"] * d["height"],
    aspect_ratio = lambda d: d["width"] / d["height"].clip(lower=1e-6),
    cx         = lambda d: (d["xmin"] + d["xmax"]) / 2,
    cy         = lambda d: (d["ymin"] + d["ymax"]) / 2,
    normalized_area = lambda d: d["area"] / (d["img_w"] * d["img_h"]),
)

GroupBy — The Workhorse Operation

# Per-image statistics
per_image = df.groupby("image_id").agg(
    n_objects    = ("class", "count"),
    n_classes    = ("class", "nunique"),
    mean_area    = ("area", "mean"),
    classes_list = ("class", list),
).reset_index()

# Per-class statistics
per_class = df.groupby("class").agg(
    count        = ("image_id", "count"),
    mean_area    = ("area", "mean"),
    median_conf  = ("confidence", "median"),
    images       = ("image_id", "nunique"),
).sort_values("count", ascending=False)

# Pivot: class × image_id — useful for co-occurrence analysis
pivot = df.pivot_table(
    index="image_id", columns="class",
    values="confidence", aggfunc="max", fill_value=0
)

Joining Predictions with Ground Truth

preds = pd.read_csv("predictions.csv")   # image_id, class, confidence, bbox...
gt    = pd.read_csv("ground_truth.csv")  # image_id, class, bbox...

# Merge on image_id to align per-image
merged = pd.merge(preds, gt, on="image_id", suffixes=("_pred", "_gt"))

# Find missed classes (FN at class level)
pred_classes = set(preds["class"].unique())
gt_classes   = set(gt["class"].unique())
missed = gt_classes - pred_classes
print(f"Classes never predicted: {missed}")

# Error analysis: highest-area false positives
fp_df = preds[(preds["iou_with_gt"] < 0.5) & (preds["confidence"] > 0.7)]
fp_df.nlargest(20, "area")

Cleaning & Validation

# Remove out-of-bounds boxes
df = df[
    (df["xmin"] >= 0) & (df["ymin"] >= 0) &
    (df["xmax"] <= df["img_w"]) & (df["ymax"] <= df["img_h"]) &
    (df["xmin"] < df["xmax"]) & (df["ymin"] < df["ymax"])
]

# Remove tiny boxes (likely annotation noise)
df = df[df["area"] > 100]

# Handle missing confidence scores
df["confidence"] = df["confidence"].fillna(1.0)  # GT has no confidence → 1

# Deduplicate (exact duplicate rows)
df = df.drop_duplicates()

apply / transform for Custom Logic

# apply: returns one value per group
iou_stats = df.groupby("class")["iou"].apply(
    lambda x: pd.Series({
        "ap50": (x > 0.5).mean(),
        "ap75": (x > 0.75).mean(),
    })
)

# transform: returns same-length Series (useful for adding group stats to rows)
df["class_mean_area"] = df.groupby("class")["area"].transform("mean")
df["area_vs_class_mean"] = df["area"] / df["class_mean_area"]

Part 2: Scikit-Learn Pipelines

Why Pipelines?

A Pipeline chains preprocessing + model into one object. Benefits:

Prevents data leakage (fit scaler ONLY on train, applies to test)
One fit() / predict() call
Fully compatible with GridSearchCV

raw CSV
   │
   ▼
ColumnTransformer
   ├── numeric: [impute → scale]
   └── categorical: [impute → one-hot]
   │
   ▼
Classifier / Regressor

Building a Pipeline

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

numeric_features  = ["area", "aspect_ratio", "cx", "cy", "width", "height"]
categoric_features = ["dataset_split", "scene_type"]

numeric_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler",  StandardScaler()),
])

categorical_transformer = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot",  OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])

preprocessor = ColumnTransformer([
    ("num",  numeric_transformer,  numeric_features),
    ("cat",  categorical_transformer, categoric_features),
])

pipeline = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier",   RandomForestClassifier(n_estimators=100, random_state=42)),
])

pipeline.fit(X_train, y_train)
preds = pipeline.predict(X_test)

Custom Sklearn Transformer

from sklearn.base import BaseEstimator, TransformerMixin

class BBoxFeatureExtractor(BaseEstimator, TransformerMixin):
    """Extracts geometric features from raw bounding box columns."""
    def __init__(self, img_w=1920, img_h=1080):
        self.img_w = img_w
        self.img_h = img_h

    def fit(self, X, y=None):
        return self   # stateless

    def transform(self, X):
        df = pd.DataFrame(X, columns=["xmin", "ymin", "xmax", "ymax"])
        w = df["xmax"] - df["xmin"]
        h = df["ymax"] - df["ymin"]
        return pd.DataFrame({
            "area":             w * h,
            "aspect_ratio":     w / h.clip(lower=1e-6),
            "cx":               (df["xmin"] + df["xmax"]) / 2 / self.img_w,
            "cy":               (df["ymin"] + df["ymax"]) / 2 / self.img_h,
            "normalized_area":  (w * h) / (self.img_w * self.img_h),
        }).to_numpy()

Cross-Validation Strategies

from sklearn.model_selection import (
    StratifiedKFold, GroupKFold, StratifiedGroupKFold, cross_val_score
)

# Standard: stratified to preserve class proportions
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X, y, cv=skf, scoring="f1_macro")

# Group: no image appears in both train and val (critical for CV — prevents leakage!)
gkf = GroupKFold(n_splits=5)
groups = df["image_id"].to_numpy()   # each bbox belongs to an image
scores = cross_val_score(pipeline, X, y, cv=gkf, groups=groups, scoring="f1_macro")

# StratifiedGroupKFold: both stratified + group-aware
sgkf = StratifiedGroupKFold(n_splits=5)
scores = cross_val_score(pipeline, X, y, cv=sgkf, groups=groups, scoring="f1_macro")

GridSearchCV / RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform

param_dist = {
    "classifier__n_estimators":     randint(50, 300),
    "classifier__max_depth":        [None, 5, 10, 20],
    "classifier__min_samples_leaf": randint(1, 20),
    "preprocessor__num__imputer__strategy": ["mean", "median"],
}

search = RandomizedSearchCV(
    pipeline, param_dist,
    n_iter=30, cv=5, scoring="f1_macro",
    n_jobs=-1, random_state=42, verbose=1,
)
search.fit(X_train, y_train)

# Results as DataFrame for analysis
results_df = pd.DataFrame(search.cv_results_)
results_df.sort_values("mean_test_score", ascending=False).head(10)

Feature Importance + SHAP

# Get feature names after pipeline transforms
feature_names = (
    numeric_features
    + pipeline["preprocessor"].transformers_[1][1]["onehot"]
               .get_feature_names_out(categoric_features).tolist()
)
importances = pipeline["classifier"].feature_importances_

feat_df = (
    pd.DataFrame({"feature": feature_names, "importance": importances})
      .sort_values("importance", ascending=False)
      .head(15)
)

Q: What is data leakage? Give a concrete example in a CV context.
A: Data leakage is when information from the test set influences training. Example: if you fit a StandardScaler on the full dataset and then split, the scaler's mean/std were computed with test data — test distribution influenced the preprocessing. Fix: always fit preprocessing ONLY on training data. sklearn Pipeline prevents this automatically.

Q: Why use GroupKFold instead of StratifiedKFold for object detection datasets?
A: Object detection datasets have multiple bounding boxes per image. If the same image appears in both train and val folds, the model has effectively "seen" those images during training (because features extracted from the same image are highly correlated). GroupKFold groups by image_id, ensuring all boxes from one image are in the same fold.

Q: Write a pandas operation to find the top-5 most confused class pairs from a prediction DataFrame.

confused = (
    df[df["pred_class"] != df["true_class"]]
      .groupby(["true_class", "pred_class"])
      .size()
      .sort_values(ascending=False)
      .head(5)
      .reset_index(name="count")
)

Q: A Pipeline has steps [('scaler', StandardScaler()), ('model', SVC())]. How do you access the scaler's mean_ after fitting?

pipeline.fit(X_train, y_train)
means = pipeline.named_steps["scaler"].mean_
# or
means = pipeline[0].mean_

Q: What is the difference between fit_transform() and transform() in sklearn?
A: fit_transform(X) is equivalent to fit(X).transform(X) — it learns parameters from X AND applies the transformation. transform(X) only applies previously learned parameters. Never call fit_transform on test data — always transform only.

Q: How does ColumnTransformer handle columns not listed in any transformer?
A: By default, unlisted columns are dropped (remainder='drop'). Set remainder='passthrough' to keep them as-is. You can also set remainder=SomeTransformer() to apply a specific transformation.

Run

pip install -r requirements.txt
python solution.py
# Outputs saved to outputs/

AI Engineer — Role-Based Learning Hub