Lab 03 — Math for ML: Linear Algebra, Calculus & Probability

Phase: 0 — Foundations | Difficulty: ⭐⭐⭐⭐☆
Files: lab.py, solution.py


Linear Algebra

Dot Product & Matrix Multiplication

The forward pass of every neural network is a series of matrix multiplications:

$$\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$$

For a batch of inputs $X \in \mathbb{R}^{N \times D_{in}}$ and weight matrix $W \in \mathbb{R}^{D_{in} \times D_{out}}$:

$$Y = XW + \mathbf{b} \quad \in \mathbb{R}^{N \times D_{out}}$$

Geometric interpretation: A linear layer projects inputs from $\mathbb{R}^{D_{in}}$ to $\mathbb{R}^{D_{out}}$ — a change of basis. The weight matrix $W$ encodes this transformation.


Eigendecomposition

For a square matrix $A$: $$A\mathbf{v} = \lambda \mathbf{v}$$

$\mathbf{v}$ is an eigenvector, $\lambda$ is the corresponding eigenvalue.

The full decomposition: $A = Q \Lambda Q^{-1}$ where $Q$ is the matrix of eigenvectors and $\Lambda$ is the diagonal matrix of eigenvalues.

For symmetric matrices (covariance matrices are always symmetric): $A = Q \Lambda Q^T$ (orthogonal $Q$).

CV Applications:

  • PCA (Eigenfaces): Eigenvectors of the face covariance matrix are "eigenfaces". The top $k$ eigenvectors capture the most variance in face images.
  • Harris corner detector: Uses eigenvalues of the structure tensor $M$:
    • $\lambda_1 \approx \lambda_2 \gg 0$: corner
    • $\lambda_1 \gg \lambda_2 \approx 0$: edge
    • $\lambda_1 \approx \lambda_2 \approx 0$: flat region

Singular Value Decomposition (SVD)

$$M = U \Sigma V^T$$

Unlike eigendecomposition, SVD works for any matrix (not just square/symmetric).

  • $U \in \mathbb{R}^{m \times m}$: left singular vectors (orthonormal)
  • $\Sigma \in \mathbb{R}^{m \times n}$: diagonal, singular values $\sigma_1 \geq \sigma_2 \geq \ldots \geq 0$
  • $V^T \in \mathbb{R}^{n \times n}$: right singular vectors (orthonormal)

Relationship to eigendecomposition:

  • Columns of $U$ = eigenvectors of $MM^T$
  • Columns of $V$ = eigenvectors of $M^TM$
  • $\sigma_i = \sqrt{\lambda_i(M^TM)}$

Low-rank approximation: $M_k = U_k \Sigma_k V_k^T$ minimizes the Frobenius norm $|M - M_k|_F$ over all rank-$k$ matrices. (Eckart–Young theorem)


Norms

NormFormulaUse in CV/ML
L1 ($\ell_1$)$\sum_ix_i
L2 ($\ell_2$)$\sqrt{\sum_i x_i^2}$Ridge regression, weight decay, distance between embeddings
L∞$\max_ix_i
Frobenius$\sqrt{\sum_{i,j} A_{ij}^2}$Matrix regularization

L2 normalization of feature vectors (used before cosine similarity): $$\hat{\mathbf{v}} = \frac{\mathbf{v}}{|\mathbf{v}|_2}$$

Then cosine similarity: $\cos(\theta) = \hat{\mathbf{u}} \cdot \hat{\mathbf{v}}$ — no division needed.


Covariance Matrix & PCA

For a dataset $X \in \mathbb{R}^{N \times D}$ (zero-centered):

$$\Sigma = \frac{1}{N-1} X^T X \in \mathbb{R}^{D \times D}$$

PCA: Eigendecompose $\Sigma = Q \Lambda Q^T$, project $X$ onto top-$k$ eigenvectors:

$$Z = X Q_k \in \mathbb{R}^{N \times k}$$

This is equivalent to: $X = U \Sigma V^T \Rightarrow$ top-$k$ principal components are columns of $V$ (or rows of $V^T$).


Calculus for Neural Networks

Chain Rule (The Core of Backpropagation)

For a composition $f(g(x))$: $$\frac{d}{dx}f(g(x)) = f'(g(x)) \cdot g'(x)$$

For multivariable functions (the chain rule that powers backprop): $$\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}$$

Backpropagation example:

Given: $L = \text{MSE}(\mathbf{y}, \hat{\mathbf{y}})$, $\hat{\mathbf{y}} = \sigma(\mathbf{z})$, $\mathbf{z} = W\mathbf{x} + \mathbf{b}$

$$\frac{\partial L}{\partial W} = \frac{\partial L}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z} \cdot \frac{\partial z}{\partial W}$$

Each term:

  • $\frac{\partial L}{\partial \hat{y}} = \frac{2}{N}(\hat{y} - y)$ (MSE gradient)
  • $\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1-\sigma(z))$ (sigmoid gradient)
  • $\frac{\partial z}{\partial W} = \mathbf{x}^T$ (linear layer gradient)

Gradient Descent

$$W \leftarrow W - \eta \cdot \nabla_W L$$

Why it works: The gradient $\nabla_W L$ points in the direction of steepest ascent. Subtracting it moves toward a local minimum.

Variants:

  • SGD: Update with one sample (or mini-batch). Noisy but escapes local minima.
  • Momentum: $v \leftarrow \beta v + (1-\beta)\nabla L$, $W \leftarrow W - \eta v$
  • Adam: Adaptive learning rate per parameter. $m \leftarrow \beta_1 m + (1-\beta_1)\nabla L$, $v \leftarrow \beta_2 v + (1-\beta_2)\nabla L^2$. Corrected: $\hat{m} = m/(1-\beta_1^t)$, $W \leftarrow W - \eta \hat{m}/(\sqrt{\hat{v}}+\epsilon)$

Convolution (Mathematical Definition)

Discrete 2D convolution of image $I$ with kernel $K$: $$(I * K)[i,j] = \sum_m \sum_n I[i-m, j-n] \cdot K[m,n]$$

In deep learning, what's called "convolution" is actually cross-correlation: $$(I \star K)[i,j] = \sum_m \sum_n I[i+m, j+n] \cdot K[m,n]$$

(The kernel is not flipped, unlike true convolution. The network learns to flip weights during training if needed.)

Output size formula: For input $(H, W)$, kernel $(k, k)$, padding $p$, stride $s$: $$H_{out} = \lfloor \frac{H + 2p - k}{s} \rfloor + 1$$


Probability for ML

Bayes' Theorem

$$P(A|B) = \frac{P(B|A) \cdot P(A)}{P(B)}$$

In ML terms: $$P(\text{class}|\text{image}) = \frac{P(\text{image}|\text{class}) \cdot P(\text{class})}{P(\text{image})}$$

  • $P(\text{class})$: prior — class distribution in training data
  • $P(\text{image}|\text{class})$: likelihood — how likely is this image from this class
  • $P(\text{class}|\text{image})$: posterior — what the classifier outputs

Class imbalance = a poorly calibrated prior. Techniques to fix: class weights in loss, oversampling, focal loss.


Entropy, Cross-Entropy, KL Divergence

Entropy (information content of distribution $P$): $$H(P) = -\sum_i P(i) \log P(i)$$

Cross-entropy (how well $Q$ approximates $P$): $$H(P, Q) = -\sum_i P(i) \log Q(i)$$

Classification loss = cross-entropy between one-hot label $P$ and softmax output $Q$: $$L = -\sum_i y_i \log \hat{p}i$$ For single label (one-hot): $L = -\log \hat{p}{true}$

KL Divergence (asymmetric "distance" between distributions): $$D_{KL}(P | Q) = \sum_i P(i) \log \frac{P(i)}{Q(i)} = H(P,Q) - H(P)$$

Used in VAEs, knowledge distillation, and feature alignment.


Interview Questions

Q: Explain SVD and name 3 applications in computer vision.
A: SVD decomposes matrix $M = U\Sigma V^T$ into two orthogonal bases and a diagonal scaling. Applications: (1) Image compression — rank-$k$ approximation via top $k$ singular values. (2) PCA / Eigenfaces — SVD of the centered data matrix gives principal components. (3) Essential/Fundamental matrix computation — the 8-point algorithm solves for the Fundamental matrix $F$ via SVD; the essential matrix $E$ is further constrained by enforcing the two smallest singular values are equal and the largest is 1.

Q: How does gradient descent find a minimum? What can go wrong?
A: Gradient descent iteratively moves parameters in the direction opposite to the gradient, shrinking loss. Problems: (1) Local minima / saddle points — in high-dimensional spaces, saddle points are more common than local minima; stochastic noise helps escape them. (2) Vanishing gradients — gradients near zero prevent early layers from learning; fixed by ReLU activations and residual connections. (3) Exploding gradients — large gradients cause divergence; fixed by gradient clipping. (4) Poor learning rate — too high diverges, too low is very slow; adaptive optimizers (Adam) mitigate this.

Q: Why is cross-entropy used as the classification loss rather than MSE?
A: Cross-entropy aligns with the probabilistic interpretation (maximizing log-likelihood of correct class). For classification, MSE penalizes equally everywhere, but we want to penalize confident wrong predictions very heavily. Cross-entropy with softmax: gradient is $\hat{p} - y$ — simple and well-scaled. MSE with sigmoid produces vanishing gradients when predictions are saturated (near 0 or 1), making early training very slow.