Lab 01: Scikit-learn Pipelines & Classical ML
What You'll Learn
- Build production-quality ML pipelines with
sklearn.pipeline.Pipeline - Understand SVMs deeply (kernel trick, RBF kernel, support vectors)
- Random Forest: bagging, feature importance, out-of-bag error
- Hyperparameter tuning: GridSearchCV vs RandomizedSearchCV
- Proper cross-validation to avoid data leakage
SVM Theory
Linear SVM
Find the hyperplane $\mathbf{w}^T \mathbf{x} + b = 0$ that maximizes the margin:
$$\text{margin} = \frac{2}{|\mathbf{w}|}$$
Subject to: $y_i(\mathbf{w}^T \mathbf{x}_i + b) \geq 1 \quad \forall i$
This is a convex quadratic program — guaranteed global optimum. The dual form:
$$\max_\alpha \sum_i \alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i \alpha_j y_i y_j \mathbf{x}_i^T \mathbf{x}_j$$
The prediction only depends on dot products $\mathbf{x}_i^T \mathbf{x}_j$ — this is the key insight for the kernel trick.
Kernel Trick
Replace the dot product with a kernel function $K(\mathbf{x}_i, \mathbf{x}_j)$ that implicitly computes a dot product in a high-dimensional feature space:
$$K_{\text{RBF}}(\mathbf{x}, \mathbf{z}) = \exp\left(-\frac{|\mathbf{x} - \mathbf{z}|^2}{2\sigma^2}\right)$$
The RBF kernel: $K(\mathbf{x}, \mathbf{z}) = e^{-\gamma |\mathbf{x}-\mathbf{z}|^2}$
- $\gamma$ large → narrow Gaussians → complex decision boundary (overfitting risk)
- $\gamma$ small → smooth decision boundary (underfitting risk)
- $C$ large → hard margin (penalize misclassification more)
- $C$ small → soft margin (allow more violations for better generalization)
Interview: Why is the kernel trick efficient?
The explicit feature map for RBF is infinite-dimensional — you can't compute it directly. The kernel function computes the dot product in that space in $O(d)$ time (where $d$ is input dimensionality), instead of computing the infinite vector and taking a dot product.
Random Forest Theory
Random Forests build $T$ decision trees, each trained on:
- A bootstrap sample (random sample with replacement) of the training set
- At each split: only $\sqrt{d}$ randomly selected features are considered
Out-of-bag (OOB) error: ~37% of samples are never selected in each bootstrap. These form a natural validation set per tree. Average OOB error across all trees is a free, unbiased estimate of test error.
Feature importance: For feature $j$, average the decrease in Gini impurity across all splits on $j$ across all trees. More reliable: permutation importance (shuffle feature $j$, measure performance drop).
Data Leakage (Critical Concept)
Wrong (leakage):
# Scaler sees test data — test statistics contaminate training
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test) # Fine so far...
# But with cross-validation, this is WRONG:
scores = cross_val_score(svm, scaler.transform(X), y, cv=5)
# ^ scaler was fit on ALL of X including the held-out folds!
Correct (Pipeline prevents leakage):
from sklearn.pipeline import Pipeline
pipe = Pipeline([
('scaler', StandardScaler()), # fit on train fold only inside CV
('svm', SVC(kernel='rbf'))
])
scores = cross_val_score(pipe, X, y, cv=StratifiedKFold(5))
# Pipeline correctly fits scaler only on training fold each time
Interview Questions
Q: SVM vs Logistic Regression — when to use which?
A: SVMs are preferred when: (1) the dataset is small-medium with complex non-linear boundaries (use RBF kernel), (2) you need a maximum margin classifier, (3) high-dimensional sparse data (text, gene expression — linear SVM works well). Logistic Regression is preferred when: (1) you need calibrated probabilities, (2) very large datasets (SGD optimization scales better than SVM's quadratic), (3) interpretability matters (weights are directly interpretable), (4) online learning.
Q: How does cross-validation prevent overfitting compared to a single train/test split?
A: A single split has high variance — you might get "lucky" or "unlucky" with which samples end up in test. K-fold CV uses K different splits and averages the metric, reducing variance by ~1/K. Stratified K-fold ensures each fold has the same class proportions as the full dataset, which matters for imbalanced classes.
Q: Explain the bias-variance tradeoff in Random Forests.
A: Individual decision trees have high variance (they overfit to their training data) but low bias. Random Forest reduces variance through averaging (Var(mean of N) = Var(single)/N, assuming independence). The random feature selection at each split decorrelates the trees, making the independence assumption more valid. Bias stays roughly the same (averages of low-bias models are still low-bias). As T→∞, the generalization error converges to the expected error — you can't overfit by adding more trees.