Lab 6-03: DINO-Style Self-Supervised Learning

Learning Objectives

  • Understand the student-teacher self-supervised paradigm
  • Implement EMA (Exponential Moving Average) teacher update
  • Build multi-crop strategy: global + local crops
  • Implement centering + sharpening (the stability tricks in DINO)
  • Train on synthetic data and verify embeddings cluster by class
  • Understand why DINO features have excellent k-NN classification performance

DINO Architecture

Image
 │
 ├─── Global crop 1 ─→ Student ─→ softmax(z/τ_s)      ← student, sharpened
 ├─── Global crop 2 ─→ Teacher ─→ softmax((z-c)/τ_t)   ← teacher, centered+sharpened
 ├─── Local crop 1  ─→ Student
 └─── Local crop 2  ─→ Student
        │
        ▼ Cross-entropy loss: student ← teacher (teacher's output is the "label")
        ▼ Teacher receives NO gradient — updated via EMA
        
θ_teacher ← λ·θ_teacher + (1-λ)·θ_student
center c  ← m·c + (1-m)·mean(teacher_output)  ← centering prevents collapse

Why These Tricks Are Necessary

TrickProblem Solved
EMA teacherProvides stable targets; gradients only flow through student
CenteringPrevents mode collapse (all outputs → same prototype)
Sharpening (low τ)Prevents uniform distribution collapse
Multi-cropMore views per image → better representations, lower cost
Stop-gradient on teacherTeacher is never directly optimized — momentum update only

DINO Loss

$$\mathcal{L} = -\sum_{\text{teacher crops}} \sum_{\text{student crops}} p_t \cdot \log p_s$$

where:

  • $p_t = \text{softmax}((z_t - c) / \tau_t)$ (centered + sharpened)
  • $p_s = \text{softmax}(z_s / \tau_s)$ (sharpened, no centering)

Interview Questions

Q: Why does DINO use a stop-gradient on the teacher and EMA updates instead of simply sharing weights?
A: If teacher = student (shared weights), any collapse mode satisfies the loss (trivially). EMA creates a slowly-moving "ensemble" of student snapshots, providing more stable and higher-quality targets. Stop-gradient prevents the teacher from receiving loss gradients — it only evolves through EMA.

Q: What is the centering operation and why is it necessary?
A: Centering subtracts a running mean c from the teacher's output before softmax. Without it, the teacher can collapse to a single dominant dimension (all embeddings → same output regardless of input). Centering decorrelates this by shifting the mean to zero, making the softmax more uniform across features.

Q: DINO vs MAE vs SimCLR — when would you use each?
A: SimCLR/MoCo: best with large batch sizes; requires negative pairs; strong linear probe accuracy. DINO: strong k-NN accuracy without fine-tuning; learns semantically meaningful patches; no negative pairs needed. MAE: masked autoencoding; better fine-tuning accuracy; faster pre-training; but weaker k-NN. Choose DINO for strong linear probing; MAE for transfer learning tasks.