Lab 6-02: CLIP-Style Contrastive Learning
Learning Objectives
- Understand contrastive learning and why it aligns image and text
- Implement InfoNCE (NT-Xent) loss from scratch
- Build an image encoder + text encoder and train jointly
- Demonstrate zero-shot classification
- Understand temperature parameter τ and its effect
CLIP Architecture
Images → Image Encoder (ViT or CNN) → L2-normalized embedding zᴵ ∈ ℝᵈ
Texts → Text Encoder (Transformer) → L2-normalized embedding zᵀ ∈ ℝᵈ
┌──────────────────────┐
Compute │ Similarity matrix S │
S = zᴵ · (zᵀ)ᵀ / τ │
└──────────────────────┘
↓
InfoNCE Loss:
• Row-wise CE (image→text)
• Col-wise CE (text→image)
• Average both directions
InfoNCE Loss
For a batch of N image-text pairs:
$$\mathcal{L}{\text{img→txt}} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(S_{ii}/\tau)}{\sum_{j=1}^{N}\exp(S_{ij}/\tau)}$$
$$\mathcal{L} = \frac{1}{2}(\mathcal{L}{\text{img→txt}} + \mathcal{L}{\text{txt→img}})$$
The diagonal of S contains positive pairs. Off-diagonal = negatives.
Zero-Shot Classification
# Prompt engineer text embeddings for each class
prompts = [f"a photo of a {cls}" for cls in class_names]
text_embs = encode_text(prompts) # (C, D)
# For each image, find closest text embedding
image_embs = encode_image(images) # (N, D)
similarities = image_embs @ text_embs.T # (N, C)
predictions = similarities.argmax(dim=-1) # no fine-tuning needed!
Interview Questions
Q: Why is temperature τ critical in InfoNCE loss?
A: τ scales the logits before softmax. Low τ → peaked distribution, hard negatives dominate, loss focuses on confusing examples. High τ → flat distribution, treats all negatives equally (less informative). CLIP uses τ as a learned parameter (initialized to 0.07). Too low τ can cause training instability; too high → slow convergence.
Q: What is the alignment-uniformity framework for understanding contrastive loss?
A: Two properties are needed for good embeddings: (1) Alignment: positive pairs should be close (low distance). (2) Uniformity: embeddings should be spread across the unit hypersphere (avoid mode collapse). InfoNCE optimizes both: diagonal terms → alignment, off-diagonal terms → repulsion → uniformity.
Q: CLIP uses 400M image-text pairs. How can you apply CLIP with limited data?
A: (1) Use pre-trained CLIP as a frozen feature extractor — zero-shot baseline. (2) Linear probe: train a linear classifier on top of CLIP features. (3) Prompt tuning (CoOp): learn continuous prompt embeddings, freeze vision/text encoders. (4) CLIP-Adapter: add lightweight adapter layers. All require much less data than full fine-tuning.