Phase 06: State-of-the-Art Vision Models
Weeks 13-14 | 3 Labs
The modern CV engineer must understand the architectural innovations that power GPT-4V, SAM, CLIP, and DINO. This phase builds ViT, CLIP, and DINO from scratch to develop deep architectural intuition.
Why SOTA Models?
- Vision Transformers (ViT): replaced CNNs as backbone in most SOTA systems
- CLIP: foundation for zero-shot recognition, image search, VLMs
- Self-supervised learning (DINO/MAE): reduce label dependency by 10-100x
- Every top-tier role (Google Brain, Meta AI, OpenAI) expects fluency here
Lab Structure
| Lab | Topic | Key Concepts |
|---|---|---|
| lab-01-vision-transformer | ViT from scratch | Patch embedding, positional encoding, TransformerEncoder, attention maps |
| lab-02-clip-contrastive | CLIP-style contrastive learning | InfoNCE loss, image-text alignment, zero-shot classification |
| lab-03-dino-self-supervised | DINO-style self-supervised | Student-teacher with EMA, multi-crop, centering + sharpening |
Key Equations
Scaled Dot-Product Attention
$$\text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$
InfoNCE Loss (CLIP)
$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\text{sim}(z_i^I, z_i^T)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(z_i^I, z_j^T)/\tau)}$$
DINO EMA Update (teacher)
$$\theta_t \leftarrow \lambda,\theta_t + (1-\lambda),\theta_s$$
Architectural Comparison
| Model | Backbone | Pre-training | Zero-shot? | Key Innovation |
|---|---|---|---|---|
| ResNet | CNN | Supervised | No | Residual connections |
| ViT-B/16 | Transformer | Supervised (JFT-300M) | No | Patches as tokens |
| CLIP | ViT + Text Enc. | Contrastive (400M pairs) | Yes | Image-text alignment |
| DINO | ViT | Self-supervised | No (but great features) | Student-teacher + EMA |
| SAM | ViT-H | SA-1B dataset | Yes | Promptable segmentation |