Lab 03 — Feature Detection: Harris, SIFT, ORB & Homography

Phase: 1 — Classical CV | Difficulty: ⭐⭐⭐⭐☆

What Are Keypoints & Descriptors?

A keypoint is a distinctive location in an image (corner, blob, etc.) characterized by:

Position $(x, y)$
Scale (size of the neighborhood used to describe it)
Orientation (dominant gradient direction for rotation invariance)
Response (detection strength)

A descriptor is a compact vector representation of the local appearance around a keypoint, designed to be distinctive and invariant to certain transformations.

Harris Corner Detector

Based on the structure tensor (second-moment matrix) of local image gradients:

$$M = \sum_{(x,y) \in W} \begin{bmatrix} I_x^2 & I_x I_y \ I_x I_y & I_y^2 \end{bmatrix}$$

The eigenvalues $\lambda_1, \lambda_2$ of $M$ reveal the local structure:

$\lambda_1$	$\lambda_2$	Region
Both large	Both large	Corner — large variation in all directions
One large, one small		Edge — large variation in one direction only
Both small	Both small	Flat — little variation

Harris response function (avoids computing eigenvalues directly): $$R = \det(M) - k \cdot \text{trace}(M)^2 = \lambda_1 \lambda_2 - k(\lambda_1 + \lambda_2)^2$$

Typical $k = 0.04$–$0.06$.

Limitations: Not scale-invariant (a corner at one scale is an edge at another scale).

SIFT — Scale-Invariant Feature Transform (Lowe, 2004)

SIFT achieves scale, rotation, and partial illumination invariance. The algorithm:

1. Scale-Space Extrema Detection

Build a Gaussian pyramid: apply Gaussians with increasing $\sigma$ to the image. Then compute Difference of Gaussians (DoG) between consecutive levels: $$D(x, y, \sigma) = (G(x, y, k\sigma) - G(x, y, \sigma)) * I(x, y)$$

DoG approximates the Laplacian of Gaussian (blob detector). Find local extrema (maxima and minima) across scale and space.

2. Keypoint Localization

Refine extrema positions via Taylor expansion. Discard low-contrast candidates and keypoints on edges (using Harris-like eigenvalue ratio criterion: $r = \lambda_{max}/\lambda_{min}$, discard if $r > 10$).

3. Orientation Assignment

For each keypoint, compute gradient magnitude and direction in the local region (scaled by $\sigma$). Build a histogram of gradient orientations (36 bins). The dominant orientation becomes the keypoint's orientation — this enables rotation invariance.

4. Descriptor Computation

Take a $16 \times 16$ window around the keypoint, divide into $4 \times 4$ cells. In each cell, compute an 8-bin gradient orientation histogram. Concatenate: $4 \times 4 \times 8 = 128$-dimensional descriptor. Normalize for illumination invariance.

128-dim SIFT descriptor: distinctive and relatively robust. Matching uses L2 distance.

ORB — Oriented FAST and Rotated BRIEF

ORB was designed as a free, faster alternative to SIFT/SURF (SIFT has a patent, though it expired in 2020).

FAST (Features from Accelerated Segment Test)

A keypoint is a corner if a contiguous arc of $n \geq 9$ pixels (out of 16) on a circle of radius 3 are all brighter or all darker than the center by a threshold. Very fast (no gradients computed).

BRIEF (Binary Robust Independent Elementary Features)

Computes a binary string descriptor by comparing intensity pairs in the keypoint's neighborhood. Sampling locations are predetermined from a pattern. 256-bit descriptor — matched via Hamming distance (XOR + popcount), much faster than L2.

ORB's contribution: Makes FAST orientation-aware (using intensity centroid), and makes BRIEF rotation-invariant (rotate the sampling pattern by the keypoint's orientation, "rBRIEF").

Comparison:

Feature	SIFT	ORB	AKAZE
Descriptor size	128 float32 = 512 bytes	256 bits = 32 bytes	61 bytes
Scale invariant	✅	✅	✅
Rotation invariant	✅	✅	✅
Affine invariant	❌	❌	Partial
Speed	Slow	Fast	Medium
Patent (2024)	Free	Free	Free

Feature Matching

Brute Force Matcher (BFMatcher)

Compares every descriptor in set A with every descriptor in set B: O(N·M) distance computations.

For SIFT: use cv2.NORM_L2
For ORB/BRIEF: use cv2.NORM_HAMMING

FLANN (Fast Library for Approximate Nearest Neighbors)

Uses tree-based structures (KD-trees for float descriptors, LSH for binary). Much faster than brute force for large descriptor sets, at the cost of occasional missed matches.

Lowe's Ratio Test

Keep a match only if the best match is significantly better than the second best: $$\frac{d_1}{d_2} < 0.75$$

Eliminates ambiguous matches where two features look similar.

Homography

A homography is a projective transformation mapping points between two planes:

$$\begin{bmatrix} x' \ y' \ 1 \end{bmatrix} \sim H \begin{bmatrix} x \ y \ 1 \end{bmatrix}, \quad H \in \mathbb{R}^{3 \times 3}$$

Has 8 degrees of freedom (9 elements, but scale is arbitrary so divide by H[2,2]).

Applications:

Panorama stitching: align overlapping photos
Document scanning: "deskew" a photographed page
AR marker tracking: map screen coordinates onto a marker plane
Visual localization: match current view to a reference map

RANSAC (Random Sample Consensus): Required because matched features include outliers. Algorithm:

Randomly sample 4 point pairs (minimum to solve homography)
Compute $H$ from these 4 pairs
Count inliers: points where $|Hx - x'| < \epsilon$ ("reprojection error")
Keep the $H$ with the most inliers
Refit $H$ on all inliers

H, mask = cv2.findHomography(pts_src, pts_dst, cv2.RANSAC, ransacReprojThreshold=5.0)

Interview Questions

Q: Explain how SIFT achieves scale invariance.
A: SIFT detects keypoints in scale-space — an image pyramid where each level is blurred by a larger sigma. Keypoints are found at local extrema in this 3D (x, y, scale) space. Because the same corner detected at different scales corresponds to the same physical feature, you can match across scale changes. The keypoint's scale is the sigma at which it was detected, and the descriptor is computed using a window scaled proportionally — so the descriptor always captures the same physical region regardless of image scale.

Q: Why is the Lowe ratio test used in feature matching, and what does the 0.75 threshold mean?
A: The ratio test compares the distance to the best match vs the second-best match. If the ratio is < 0.75, the best match is "clearly better" than alternatives, so the match is considered reliable. If ratio ≥ 0.75, the feature looks similar to multiple candidates — it's ambiguous. The 0.75 threshold was empirically found by Lowe to give the best tradeoff between false positives (incorrect matches) and false negatives (missed correct matches). For stricter matching (less false positives), lower the threshold; for more matches, raise it.

Q: What is RANSAC and why is it needed for homography estimation?
A: Feature matching always produces some incorrect (outlier) matches even after the ratio test. Standard least-squares for homography fitting assumes all input correspondences are correct — one outlier can dominate the solution. RANSAC is a robust fitting algorithm: it randomly samples the minimum number of points needed (4 for homography), computes a candidate model, then counts how many other points fit that model (inliers). This is repeated many times; the model with the most inliers wins. Final homography is re-fit on all inliers. RANSAC is used everywhere: SLAM, structure-from-motion, PnP pose estimation.

AI Engineer — Role-Based Learning Hub