Lab 03 — Math for ML: Linear Algebra, Calculus & Probability
Phase: 0 — Foundations | Difficulty: ⭐⭐⭐⭐☆
Files: lab.py, solution.py
Linear Algebra
Dot Product & Matrix Multiplication
The forward pass of every neural network is a series of matrix multiplications:
$$\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$$
For a batch of inputs $X \in \mathbb{R}^{N \times D_{in}}$ and weight matrix $W \in \mathbb{R}^{D_{in} \times D_{out}}$:
$$Y = XW + \mathbf{b} \quad \in \mathbb{R}^{N \times D_{out}}$$
Geometric interpretation: A linear layer projects inputs from $\mathbb{R}^{D_{in}}$ to $\mathbb{R}^{D_{out}}$ — a change of basis. The weight matrix $W$ encodes this transformation.
Eigendecomposition
For a square matrix $A$: $$A\mathbf{v} = \lambda \mathbf{v}$$
$\mathbf{v}$ is an eigenvector, $\lambda$ is the corresponding eigenvalue.
The full decomposition: $A = Q \Lambda Q^{-1}$ where $Q$ is the matrix of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues.
For symmetric matrices (covariance matrices are always symmetric): $A = Q \Lambda Q^T$ (orthogonal $Q$).
CV Applications:
- PCA (Eigenfaces): Eigenvectors of the face covariance matrix are "eigenfaces". The top $k$ eigenvectors capture the most variance in face images.
- Harris corner detector: Uses eigenvalues of the structure tensor $M$:
- $\lambda_1 \approx \lambda_2 \gg 0$: corner
- $\lambda_1 \gg \lambda_2 \approx 0$: edge
- $\lambda_1 \approx \lambda_2 \approx 0$: flat region
Singular Value Decomposition (SVD)
$$M = U \Sigma V^T$$
Unlike eigendecomposition, SVD works for any matrix (not just square/symmetric).
- $U \in \mathbb{R}^{m \times m}$: left singular vectors (orthonormal)
- $\Sigma \in \mathbb{R}^{m \times n}$: diagonal, singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq 0$
- $V^T \in \mathbb{R}^{n \times n}$: right singular vectors (orthonormal)
Relationship to eigendecomposition:
- Columns of $U$ = eigenvectors of $MM^T$
- Columns of $V$ = eigenvectors of $M^TM$
- $\sigma_i = \sqrt{\lambda_i(M^TM)}$
Low-rank approximation: $M_k = U_k \Sigma_k V_k^T$ minimizes the Frobenius norm $|M - M_k|_F$ over all rank-$k$ matrices. (Eckart–Young theorem)
Norms
| Norm | Formula | Use in CV/ML |
|---|---|---|
| L1 ($\ell_1$) | $\sum_i | x_i |
| L2 ($\ell_2$) | $\sqrt{\sum_i x_i^2}$ | Ridge regression, weight decay, distance between embeddings |
| L∞ | $\max_i | x_i |
| Frobenius | $\sqrt{\sum_{i,j} A_{ij}^2}$ | Matrix regularization |
L2 normalization of feature vectors (used before cosine similarity): $$\hat{\mathbf{v}} = \frac{\mathbf{v}}{|\mathbf{v}|_2}$$
Then cosine similarity: $\cos(\theta) = \hat{\mathbf{u}} \cdot \hat{\mathbf{v}}$ — no division needed.
Covariance Matrix & PCA
For a dataset $X \in \mathbb{R}^{N \times D}$ (zero-centered):
$$\Sigma = \frac{1}{N-1} X^T X \in \mathbb{R}^{D \times D}$$
PCA: Eigendecompose $\Sigma = Q \Lambda Q^T$, project $X$ onto top-$k$ eigenvectors:
$$Z = X Q_k \in \mathbb{R}^{N \times k}$$
This is equivalent to: $X = U \Sigma V^T \Rightarrow$ top-$k$ principal components are columns of $V$ (or rows of $V^T$).
Calculus for Neural Networks
Chain Rule (The Core of Backpropagation)
For a composition $f(g(x))$: $$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$$
For multivariable functions (the chain rule that powers backprop): $$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$$
Backpropagation example:
Given: $L = \text{MSE}(\mathbf{y}, \hat{\mathbf{y}})$, $\hat{\mathbf{y}} = \sigma(\mathbf{z})$, $\mathbf{z} = W\mathbf{x} + \mathbf{b}$
$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}$$
Each term:
- $\frac{\partial L}{\partial \hat{y}} = \frac{2}{N}(\hat{y} - y)$ (MSE gradient)
- $\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1-\sigma(z))$ (sigmoid gradient)
- $\frac{\partial z}{\partial W} = \mathbf{x}^T$ (linear layer gradient)
Gradient Descent
$$W \leftarrow W - \eta \cdot \nabla_W L$$
Why it works: The gradient $\nabla_W L$ points in the direction of steepest ascent. Subtracting it moves toward a local minimum.
Variants:
- SGD: Update with one sample (or mini-batch). Noisy but escapes local minima.
- Momentum: $v \leftarrow \beta v + (1-\beta)\nabla L$, $W \leftarrow W - \eta v$
- Adam: Adaptive learning rate per parameter. $m \leftarrow \beta_1 m + (1-\beta_1)\nabla L$, $v \leftarrow \beta_2 v + (1-\beta_2)\nabla L^2$. Corrected: $\hat{m} = m/(1-\beta_1^t)$, $W \leftarrow W - \eta \hat{m}/(\sqrt{\hat{v}}+\epsilon)$
Convolution (Mathematical Definition)
Discrete 2D convolution of image $I$ with kernel $K$: $$(I * K)[i,j] = \sum_m \sum_n I[i-m, j-n] \cdot K[m,n]$$
In deep learning, what's called "convolution" is actually cross-correlation: $$(I \star K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]$$
(The kernel is not flipped, unlike true convolution. The network learns to flip weights during training if needed.)
Output size formula: For input $(H, W)$, kernel $(k, k)$, padding $p$, stride $s$: $$H_{out} = \lfloor \frac{H + 2p - k}{s} \rfloor + 1$$
Probability for ML
Bayes' Theorem
$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$
In ML terms: $$P(\text{class}|\text{image}) = \frac{P(\text{image}|\text{class}) \cdot P(\text{class})}{P(\text{image})}$$
- $P(\text{class})$: prior — class distribution in training data
- $P(\text{image}|\text{class})$: likelihood — how likely is this image from this class
- $P(\text{class}|\text{image})$: posterior — what the classifier outputs
Class imbalance = a poorly calibrated prior. Techniques to fix: class weights in loss, oversampling, focal loss.
Entropy, Cross-Entropy, KL Divergence
Entropy (information content of distribution $P$): $$H(P) = -\sum_i P(i) \log P(i)$$
Cross-entropy (how well $Q$ approximates $P$): $$H(P, Q) = -\sum_i P(i) \log Q(i)$$
Classification loss = cross-entropy between one-hot label $P$ and softmax output $Q$: $$L = -\sum_i y_i \log \hat{p}i$$ For single label (one-hot): $L = -\log \hat{p}{true}$
KL Divergence (asymmetric "distance" between distributions): $$D_{KL}(P | Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} = H(P,Q) - H(P)$$
Used in VAEs, knowledge distillation, and feature alignment.
Interview Questions
Q: Explain SVD and name 3 applications in computer vision.
A: SVD decomposes matrix $M = U\Sigma V^T$ into two orthogonal bases and a diagonal scaling. Applications: (1) Image compression — rank-$k$ approximation via top $k$ singular values. (2) PCA / Eigenfaces — SVD of the centered data matrix gives principal components. (3) Essential/Fundamental matrix computation — the 8-point algorithm solves for the Fundamental matrix $F$ via SVD; the essential matrix $E$ is further constrained by enforcing the two smallest singular values are equal and the largest is 1.
Q: How does gradient descent find a minimum? What can go wrong?
A: Gradient descent iteratively moves parameters in the direction opposite to the gradient, shrinking loss. Problems: (1) Local minima / saddle points — in high-dimensional spaces, saddle points are more common than local minima; stochastic noise helps escape them. (2) Vanishing gradients — gradients near zero prevent early layers from learning; fixed by ReLU activations and residual connections. (3) Exploding gradients — large gradients cause divergence; fixed by gradient clipping. (4) Poor learning rate — too high diverges, too low is very slow; adaptive optimizers (Adam) mitigate this.
Q: Why is cross-entropy used as the classification loss rather than MSE?
A: Cross-entropy aligns with the probabilistic interpretation (maximizing log-likelihood of correct class). For classification, MSE penalizes equally everywhere, but we want to penalize confident wrong predictions very heavily. Cross-entropy with softmax: gradient is $\hat{p} - y$ — simple and well-scaled. MSE with sigmoid produces vanishing gradients when predictions are saturated (near 0 or 1), making early training very slow.