Lab 03 — Feature Detection: Harris, SIFT, ORB & Homography
Phase: 1 — Classical CV | Difficulty: ⭐⭐⭐⭐☆
What Are Keypoints & Descriptors?
A keypoint is a distinctive location in an image (corner, blob, etc.) characterized by:
- Position $(x, y)$
- Scale (size of the neighborhood used to describe it)
- Orientation (dominant gradient direction for rotation invariance)
- Response (detection strength)
A descriptor is a compact vector representation of the local appearance around a keypoint, designed to be distinctive and invariant to certain transformations.
Harris Corner Detector
Based on the structure tensor (second-moment matrix) of local image gradients:
$$M = \sum_{(x,y) \in W} \begin{bmatrix} I_x^2 & I_x I_y \ I_x I_y & I_y^2 \end{bmatrix}$$
The eigenvalues $\lambda_1, \lambda_2$ of $M$ reveal the local structure:
| $\lambda_1$ | $\lambda_2$ | Region |
|---|---|---|
| Both large | Both large | Corner — large variation in all directions |
| One large, one small | Edge — large variation in one direction only | |
| Both small | Both small | Flat — little variation |
Harris response function (avoids computing eigenvalues directly): $$R = \det(M) - k \cdot \text{trace}(M)^2 = \lambda_1 \lambda_2 - k(\lambda_1 + \lambda_2)^2$$
Typical $k = 0.04$–$0.06$.
Limitations: Not scale-invariant (a corner at one scale is an edge at another scale).
SIFT — Scale-Invariant Feature Transform (Lowe, 2004)
SIFT achieves scale, rotation, and partial illumination invariance. The algorithm:
1. Scale-Space Extrema Detection
Build a Gaussian pyramid: apply Gaussians with increasing $\sigma$ to the image. Then compute Difference of Gaussians (DoG) between consecutive levels: $$D(x, y, \sigma) = (G(x, y, k\sigma) - G(x, y, \sigma)) * I(x, y)$$
DoG approximates the Laplacian of Gaussian (blob detector). Find local extrema (maxima and minima) across scale and space.
2. Keypoint Localization
Refine extrema positions via Taylor expansion. Discard low-contrast candidates and keypoints on edges (using Harris-like eigenvalue ratio criterion: $r = \lambda_{max}/\lambda_{min}$, discard if $r > 10$).
3. Orientation Assignment
For each keypoint, compute gradient magnitude and direction in the local region (scaled by $\sigma$). Build a histogram of gradient orientations (36 bins). The dominant orientation becomes the keypoint's orientation — this enables rotation invariance.
4. Descriptor Computation
Take a $16 \times 16$ window around the keypoint, divide into $4 \times 4$ cells. In each cell, compute an 8-bin gradient orientation histogram. Concatenate: $4 \times 4 \times 8 = 128$-dimensional descriptor. Normalize for illumination invariance.
128-dim SIFT descriptor: distinctive and relatively robust. Matching uses L2 distance.
ORB — Oriented FAST and Rotated BRIEF
ORB was designed as a free, faster alternative to SIFT/SURF (SIFT has a patent, though it expired in 2020).
FAST (Features from Accelerated Segment Test)
A keypoint is a corner if a contiguous arc of $n \geq 9$ pixels (out of 16) on a circle of radius 3 are all brighter or all darker than the center by a threshold. Very fast (no gradients computed).
BRIEF (Binary Robust Independent Elementary Features)
Computes a binary string descriptor by comparing intensity pairs in the keypoint's neighborhood. Sampling locations are predetermined from a pattern. 256-bit descriptor — matched via Hamming distance (XOR + popcount), much faster than L2.
ORB's contribution: Makes FAST orientation-aware (using intensity centroid), and makes BRIEF rotation-invariant (rotate the sampling pattern by the keypoint's orientation, "rBRIEF").
Comparison:
| Feature | SIFT | ORB | AKAZE |
|---|---|---|---|
| Descriptor size | 128 float32 = 512 bytes | 256 bits = 32 bytes | 61 bytes |
| Scale invariant | ✅ | ✅ | ✅ |
| Rotation invariant | ✅ | ✅ | ✅ |
| Affine invariant | ❌ | ❌ | Partial |
| Speed | Slow | Fast | Medium |
| Patent (2024) | Free | Free | Free |
Feature Matching
Brute Force Matcher (BFMatcher)
Compares every descriptor in set A with every descriptor in set B: O(N·M) distance computations.
- For SIFT: use
cv2.NORM_L2 - For ORB/BRIEF: use
cv2.NORM_HAMMING
FLANN (Fast Library for Approximate Nearest Neighbors)
Uses tree-based structures (KD-trees for float descriptors, LSH for binary). Much faster than brute force for large descriptor sets, at the cost of occasional missed matches.
Lowe's Ratio Test
Keep a match only if the best match is significantly better than the second best: $$\frac{d_1}{d_2} < 0.75$$
Eliminates ambiguous matches where two features look similar.
Homography
A homography is a projective transformation mapping points between two planes:
$$\begin{bmatrix} x' \ y' \ 1 \end{bmatrix} \sim H \begin{bmatrix} x \ y \ 1 \end{bmatrix}, \quad H \in \mathbb{R}^{3 \times 3}$$
Has 8 degrees of freedom (9 elements, but scale is arbitrary so divide by H[2,2]).
Applications:
- Panorama stitching: align overlapping photos
- Document scanning: "deskew" a photographed page
- AR marker tracking: map screen coordinates onto a marker plane
- Visual localization: match current view to a reference map
RANSAC (Random Sample Consensus): Required because matched features include outliers. Algorithm:
- Randomly sample 4 point pairs (minimum to solve homography)
- Compute $H$ from these 4 pairs
- Count inliers: points where $|Hx - x'| < \epsilon$ ("reprojection error")
- Keep the $H$ with the most inliers
- Refit $H$ on all inliers
H, mask = cv2.findHomography(pts_src, pts_dst, cv2.RANSAC, ransacReprojThreshold=5.0)
Interview Questions
Q: Explain how SIFT achieves scale invariance.
A: SIFT detects keypoints in scale-space — an image pyramid where each level is blurred by a larger sigma. Keypoints are found at local extrema in this 3D (x, y, scale) space. Because the same corner detected at different scales corresponds to the same physical feature, you can match across scale changes. The keypoint's scale is the sigma at which it was detected, and the descriptor is computed using a window scaled proportionally — so the descriptor always captures the same physical region regardless of image scale.
Q: Why is the Lowe ratio test used in feature matching, and what does the 0.75 threshold mean?
A: The ratio test compares the distance to the best match vs the second-best match. If the ratio is < 0.75, the best match is "clearly better" than alternatives, so the match is considered reliable. If ratio ≥ 0.75, the feature looks similar to multiple candidates — it's ambiguous. The 0.75 threshold was empirically found by Lowe to give the best tradeoff between false positives (incorrect matches) and false negatives (missed correct matches). For stricter matching (less false positives), lower the threshold; for more matches, raise it.
Q: What is RANSAC and why is it needed for homography estimation?
A: Feature matching always produces some incorrect (outlier) matches even after the ratio test. Standard least-squares for homography fitting assumes all input correspondences are correct — one outlier can dominate the solution. RANSAC is a robust fitting algorithm: it randomly samples the minimum number of points needed (4 for homography), computes a candidate model, then counts how many other points fit that model (inliers). This is repeated many times; the model with the most inliers wins. Final homography is re-fit on all inliers. RANSAC is used everywhere: SLAM, structure-from-motion, PnP pose estimation.