Lab 6-03: DINO-Style Self-Supervised Learning
Learning Objectives
- Understand the student-teacher self-supervised paradigm
- Implement EMA (Exponential Moving Average) teacher update
- Build multi-crop strategy: global + local crops
- Implement centering + sharpening (the stability tricks in DINO)
- Train on synthetic data and verify embeddings cluster by class
- Understand why DINO features have excellent k-NN classification performance
DINO Architecture
Image
│
├─── Global crop 1 ─→ Student ─→ softmax(z/τ_s) ← student, sharpened
├─── Global crop 2 ─→ Teacher ─→ softmax((z-c)/τ_t) ← teacher, centered+sharpened
├─── Local crop 1 ─→ Student
└─── Local crop 2 ─→ Student
│
▼ Cross-entropy loss: student ← teacher (teacher's output is the "label")
▼ Teacher receives NO gradient — updated via EMA
θ_teacher ← λ·θ_teacher + (1-λ)·θ_student
center c ← m·c + (1-m)·mean(teacher_output) ← centering prevents collapse
Why These Tricks Are Necessary
| Trick | Problem Solved |
|---|---|
| EMA teacher | Provides stable targets; gradients only flow through student |
| Centering | Prevents mode collapse (all outputs → same prototype) |
| Sharpening (low τ) | Prevents uniform distribution collapse |
| Multi-crop | More views per image → better representations, lower cost |
| Stop-gradient on teacher | Teacher is never directly optimized — momentum update only |
DINO Loss
$$\mathcal{L} = -\sum_{\text{teacher crops}} \sum_{\text{student crops}} p_t \cdot \log p_s$$
where:
- $p_t = \text{softmax}((z_t - c) / \tau_t)$ (centered + sharpened)
- $p_s = \text{softmax}(z_s / \tau_s)$ (sharpened, no centering)
Interview Questions
Q: Why does DINO use a stop-gradient on the teacher and EMA updates instead of simply sharing weights?
A: If teacher = student (shared weights), any collapse mode satisfies the loss (trivially). EMA creates a slowly-moving "ensemble" of student snapshots, providing more stable and higher-quality targets. Stop-gradient prevents the teacher from receiving loss gradients — it only evolves through EMA.
Q: What is the centering operation and why is it necessary?
A: Centering subtracts a running mean c from the teacher's output before softmax. Without it, the teacher can collapse to a single dominant dimension (all embeddings → same output regardless of input). Centering decorrelates this by shifting the mean to zero, making the softmax more uniform across features.
Q: DINO vs MAE vs SimCLR — when would you use each?
A: SimCLR/MoCo: best with large batch sizes; requires negative pairs; strong linear probe accuracy. DINO: strong k-NN accuracy without fine-tuning; learns semantically meaningful patches; no negative pairs needed. MAE: masked autoencoding; better fine-tuning accuracy; faster pre-training; but weaker k-NN. Choose DINO for strong linear probing; MAE for transfer learning tasks.