Lab 6-02: CLIP-Style Contrastive Learning

Learning Objectives

  • Understand contrastive learning and why it aligns image and text
  • Implement InfoNCE (NT-Xent) loss from scratch
  • Build an image encoder + text encoder and train jointly
  • Demonstrate zero-shot classification
  • Understand temperature parameter τ and its effect

CLIP Architecture

Images → Image Encoder (ViT or CNN) → L2-normalized embedding zᴵ ∈ ℝᵈ
Texts  → Text Encoder  (Transformer) → L2-normalized embedding zᵀ ∈ ℝᵈ

                                    ┌──────────────────────┐
                           Compute  │ Similarity matrix S  │
                           S = zᴵ · (zᵀ)ᵀ / τ             │
                                    └──────────────────────┘
                                           ↓
                                    InfoNCE Loss:
                               • Row-wise CE (image→text)
                               • Col-wise CE (text→image)
                               • Average both directions

InfoNCE Loss

For a batch of N image-text pairs:

$$\mathcal{L}{\text{img→txt}} = -\frac{1}{N}\sum{i=1}^{N}\log\frac{\exp(S_{ii}/\tau)}{\sum_{j=1}^{N}\exp(S_{ij}/\tau)}$$

$$\mathcal{L} = \frac{1}{2}(\mathcal{L}{\text{img→txt}} + \mathcal{L}{\text{txt→img}})$$

The diagonal of S contains positive pairs. Off-diagonal = negatives.

Zero-Shot Classification

# Prompt engineer text embeddings for each class
prompts = [f"a photo of a {cls}" for cls in class_names]
text_embs = encode_text(prompts)   # (C, D)

# For each image, find closest text embedding
image_embs = encode_image(images)  # (N, D)
similarities = image_embs @ text_embs.T   # (N, C)
predictions = similarities.argmax(dim=-1)  # no fine-tuning needed!

Interview Questions

Q: Why is temperature τ critical in InfoNCE loss?
A: τ scales the logits before softmax. Low τ → peaked distribution, hard negatives dominate, loss focuses on confusing examples. High τ → flat distribution, treats all negatives equally (less informative). CLIP uses τ as a learned parameter (initialized to 0.07). Too low τ can cause training instability; too high → slow convergence.

Q: What is the alignment-uniformity framework for understanding contrastive loss?
A: Two properties are needed for good embeddings: (1) Alignment: positive pairs should be close (low distance). (2) Uniformity: embeddings should be spread across the unit hypersphere (avoid mode collapse). InfoNCE optimizes both: diagonal terms → alignment, off-diagonal terms → repulsion → uniformity.

Q: CLIP uses 400M image-text pairs. How can you apply CLIP with limited data?
A: (1) Use pre-trained CLIP as a frozen feature extractor — zero-shot baseline. (2) Linear probe: train a linear classifier on top of CLIP features. (3) Prompt tuning (CoOp): learn continuous prompt embeddings, freeze vision/text encoders. (4) CLIP-Adapter: add lightweight adapter layers. All require much less data than full fine-tuning.