Phase 06: State-of-the-Art Vision Models

Weeks 13-14 | 3 Labs

The modern CV engineer must understand the architectural innovations that power GPT-4V, SAM, CLIP, and DINO. This phase builds ViT, CLIP, and DINO from scratch to develop deep architectural intuition.

Why SOTA Models?

Vision Transformers (ViT): replaced CNNs as backbone in most SOTA systems
CLIP: foundation for zero-shot recognition, image search, VLMs
Self-supervised learning (DINO/MAE): reduce label dependency by 10-100x
Every top-tier role (Google Brain, Meta AI, OpenAI) expects fluency here

Lab Structure

Lab	Topic	Key Concepts
lab-01-vision-transformer	ViT from scratch	Patch embedding, positional encoding, TransformerEncoder, attention maps
lab-02-clip-contrastive	CLIP-style contrastive learning	InfoNCE loss, image-text alignment, zero-shot classification
lab-03-dino-self-supervised	DINO-style self-supervised	Student-teacher with EMA, multi-crop, centering + sharpening

Key Equations

Scaled Dot-Product Attention

$$\text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

InfoNCE Loss (CLIP)

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\text{sim}(z_i^I, z_i^T)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(z_i^I, z_j^T)/\tau)}$$

DINO EMA Update (teacher)

$$\theta_t \leftarrow \lambda,\theta_t + (1-\lambda),\theta_s$$

Architectural Comparison

Model	Backbone	Pre-training	Zero-shot?	Key Innovation
ResNet	CNN	Supervised	No	Residual connections
ViT-B/16	Transformer	Supervised (JFT-300M)	No	Patches as tokens
CLIP	ViT + Text Enc.	Contrastive (400M pairs)	Yes	Image-text alignment
DINO	ViT	Self-supervised	No (but great features)	Student-teacher + EMA
SAM	ViT-H	SA-1B dataset	Yes	Promptable segmentation

AI Engineer — Role-Based Learning Hub