Phase 06: State-of-the-Art Vision Models

Weeks 13-14 | 3 Labs

The modern CV engineer must understand the architectural innovations that power GPT-4V, SAM, CLIP, and DINO. This phase builds ViT, CLIP, and DINO from scratch to develop deep architectural intuition.

Why SOTA Models?

  • Vision Transformers (ViT): replaced CNNs as backbone in most SOTA systems
  • CLIP: foundation for zero-shot recognition, image search, VLMs
  • Self-supervised learning (DINO/MAE): reduce label dependency by 10-100x
  • Every top-tier role (Google Brain, Meta AI, OpenAI) expects fluency here

Lab Structure

LabTopicKey Concepts
lab-01-vision-transformerViT from scratchPatch embedding, positional encoding, TransformerEncoder, attention maps
lab-02-clip-contrastiveCLIP-style contrastive learningInfoNCE loss, image-text alignment, zero-shot classification
lab-03-dino-self-supervisedDINO-style self-supervisedStudent-teacher with EMA, multi-crop, centering + sharpening

Key Equations

Scaled Dot-Product Attention

$$\text{Attention}(Q,K,V) = \text{softmax}!\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

InfoNCE Loss (CLIP)

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N} \log \frac{\exp(\text{sim}(z_i^I, z_i^T)/\tau)}{\sum_{j=1}^{N}\exp(\text{sim}(z_i^I, z_j^T)/\tau)}$$

DINO EMA Update (teacher)

$$\theta_t \leftarrow \lambda,\theta_t + (1-\lambda),\theta_s$$

Architectural Comparison

ModelBackbonePre-trainingZero-shot?Key Innovation
ResNetCNNSupervisedNoResidual connections
ViT-B/16TransformerSupervised (JFT-300M)NoPatches as tokens
CLIPViT + Text Enc.Contrastive (400M pairs)YesImage-text alignment
DINOViTSelf-supervisedNo (but great features)Student-teacher + EMA
SAMViT-HSA-1B datasetYesPromptable segmentation