Lab 01: Scikit-learn Pipelines & Classical ML

What You'll Learn

Build production-quality ML pipelines with sklearn.pipeline.Pipeline
Understand SVMs deeply (kernel trick, RBF kernel, support vectors)
Random Forest: bagging, feature importance, out-of-bag error
Hyperparameter tuning: GridSearchCV vs RandomizedSearchCV
Proper cross-validation to avoid data leakage

SVM Theory

Linear SVM

Find the hyperplane $\mathbf{w}^T \mathbf{x} + b = 0$ that maximizes the margin:

$$\text{margin} = \frac{2}{|\mathbf{w}|}$$

Subject to: $y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i$

This is a convex quadratic program — guaranteed global optimum. The dual form:

$$\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$

The prediction only depends on dot products $\mathbf{x}_i^T \mathbf{x}_j$ — this is the key insight for the kernel trick.

Kernel Trick

Replace the dot product with a kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ that implicitly computes a dot product in a high-dimensional feature space:

$$K_{\text{RBF}}(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right)$$

The RBF kernel: $K(\mathbf{x}, \mathbf{z}) = e^{-\gamma |\mathbf{x}-\mathbf{z}|^2}$

$\gamma$ large → narrow Gaussians → complex decision boundary (overfitting risk)
$\gamma$ small → smooth decision boundary (underfitting risk)
$C$ large → hard margin (penalize misclassification more)
$C$ small → soft margin (allow more violations for better generalization)

Interview: Why is the kernel trick efficient?

The explicit feature map for RBF is infinite-dimensional — you can't compute it directly. The kernel function computes the dot product in that space in $O(d)$ time (where $d$ is input dimensionality), instead of computing the infinite vector and taking a dot product.

Random Forest Theory

Random Forests build $T$ decision trees, each trained on:

A bootstrap sample (random sample with replacement) of the training set
At each split: only $\sqrt{d}$ randomly selected features are considered

Out-of-bag (OOB) error: ~37% of samples are never selected in each bootstrap. These form a natural validation set per tree. Average OOB error across all trees is a free, unbiased estimate of test error.

Feature importance: For feature $j$, average the decrease in Gini impurity across all splits on $j$ across all trees. More reliable: permutation importance (shuffle feature $j$, measure performance drop).

Data Leakage (Critical Concept)

Wrong (leakage):

# Scaler sees test data — test statistics contaminate training
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)  # Fine so far...

# But with cross-validation, this is WRONG:
scores = cross_val_score(svm, scaler.transform(X), y, cv=5)
# ^ scaler was fit on ALL of X including the held-out folds!

Correct (Pipeline prevents leakage):

from sklearn.pipeline import Pipeline

pipe = Pipeline([
    ('scaler', StandardScaler()),  # fit on train fold only inside CV
    ('svm', SVC(kernel='rbf'))
])
scores = cross_val_score(pipe, X, y, cv=StratifiedKFold(5))
# Pipeline correctly fits scaler only on training fold each time

Interview Questions

Q: SVM vs Logistic Regression — when to use which?

A: SVMs are preferred when: (1) the dataset is small-medium with complex non-linear boundaries (use RBF kernel), (2) you need a maximum margin classifier, (3) high-dimensional sparse data (text, gene expression — linear SVM works well). Logistic Regression is preferred when: (1) you need calibrated probabilities, (2) very large datasets (SGD optimization scales better than SVM's quadratic), (3) interpretability matters (weights are directly interpretable), (4) online learning.

Q: How does cross-validation prevent overfitting compared to a single train/test split?

A: A single split has high variance — you might get "lucky" or "unlucky" with which samples end up in test. K-fold CV uses K different splits and averages the metric, reducing variance by ~1/K. Stratified K-fold ensures each fold has the same class proportions as the full dataset, which matters for imbalanced classes.

Q: Explain the bias-variance tradeoff in Random Forests.

A: Individual decision trees have high variance (they overfit to their training data) but low bias. Random Forest reduces variance through averaging (Var(mean of N) = Var(single)/N, assuming independence). The random feature selection at each split decorrelates the trees, making the independence assumption more valid. Bias stays roughly the same (averages of low-bias models are still low-bias). As T→∞, the generalization error converges to the expected error — you can't overfit by adding more trees.

AI Engineer — Role-Based Learning Hub