CV Engineer Interview Prep — Concepts Cheatsheet
Quick reference for every topic that comes up in CV engineer interviews. Use this the week before your interview for rapid review.
1. Convolutional Neural Networks
Core Operations
| Operation | Formula | Purpose |
|---|---|---|
| Convolution | $y_{i,j} = \sum_{m,n} x_{i+m,j+n} \cdot k_{m,n}$ | Feature extraction |
| Max pooling | $y = \max_{w \times w \text{ region}} x$ | Spatial invariance, dimensionality reduction |
| Global avg pool | $y = \frac{1}{HW} \sum_{i,j} x_{i,j}$ | Replace FC layers, parameter reduction |
| Depthwise sep conv | DW conv + pointwise conv | MobileNet: 8-9× fewer FLOPs than regular conv |
Receptive Field
$$RF_k = RF_{k-1} + (f_k - 1) \cdot \prod_{i<k} s_i$$
where $f_k$ = kernel size at layer $k$, $s_i$ = stride at layer $i$.
Dilation: multiply receptive field without reducing spatial resolution. Dilated conv with rate $d$: gaps of $d-1$ between kernel elements. $RF = (2d+1) \times (2d+1)$ for 3×3 kernel.
2. Optimization
Gradient Descent Variants
| Method | Update | When to use |
|---|---|---|
| SGD | $\theta \leftarrow \theta - \alpha g$ | Large datasets, sparse updates |
| SGD+Momentum | $v \leftarrow \beta v + g$; $\theta \leftarrow \theta - \alpha v$ | Faster convergence, escapes local minima |
| RMSprop | $v \leftarrow \beta v + (1-\beta)g^2$; $\theta \leftarrow \theta - \alpha g/\sqrt{v+\epsilon}$ | Non-stationary, RNNs |
| Adam | 1st + 2nd moment with bias correction | Default choice for most tasks |
| AdamW | Adam + decoupled weight decay | Transformers, large models |
Learning Rate Schedules
- Warmup + cosine decay: standard for transformers. Prevents early instability.
- OneCycleLR: fast training, often best for CNNs.
- Linear scaling rule: multiply LR by $k$ when batch size is $k \times$ baseline.
- Gradient clipping: clip norm to 1.0 — prevents exploding gradients in RNNs/transformers.
3. Regularization
| Technique | How it works | When to use |
|---|---|---|
| L2 (weight decay) | Penalize $|\theta|^2$ — shrinks weights toward 0 | Always, standard |
| Dropout | Zero activations with prob $p$, scale by $1/(1-p)$ | FC layers, transformers |
| Batch Norm | Normalize activations within batch | CNNs, stabilizes training |
| Data augmentation | Artificially expand training set | All tasks |
| Label smoothing | Replace hard 0/1 with $\epsilon/(C-1)$ / $1-\epsilon$ | Classification, large datasets |
| Mixup | Blend two images: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$ | Classification, detection |
| CutMix | Cut patch from one image, paste into another | Segmentation awareness |
4. Architectures
ResNet — Skip Connections
$$\mathcal{F}(x) = H(x) - x \rightarrow \text{learn residual, not full mapping}$$
Key insight: gradient can flow directly through skip connection. Solves vanishing gradient for 100+ layer networks.
Bottleneck block: 1×1→3×3→1×1 convolutions. Reduces channels before 3×3, expands after. 4× fewer FLOPs than basic block at same capacity.
EfficientNet — Compound Scaling
Scale depth $d = \alpha^\phi$, width $w = \beta^\phi$, resolution $r = \gamma^\phi$ such that $\alpha\beta^2\gamma^2 \approx 2$ (FLOPs double per step).
Vision Transformer (ViT)
- Split image into $P \times P$ patches (typically 16×16)
- Linear projection → sequence of tokens
- Add [CLS] token + positional embeddings
- Stack Transformer encoder layers
- Classify using [CLS] output
Limitation: requires more data than CNNs (no inductive bias). Pretrain on JFT-300M or use DeiT data augmentation.
5. Object Detection
Single-Stage vs Two-Stage
| Single-Stage (YOLO, SSD) | Two-Stage (Faster R-CNN) | |
|---|---|---|
| Speed | Fast (30-100+ FPS) | Slow (5-15 FPS) |
| Accuracy | Good | Better (especially small objects) |
| Anchors | Yes (YOLO v3-v5) or no (v8) | Yes (RPN) |
| Use case | Real-time | High-accuracy offline |
Key Metrics
- mAP@0.5: IoU threshold = 0.5 for TP/FP determination
- mAP@0.5:0.95: COCO metric, average over [0.5, 0.55, ..., 0.95]
- AP50 > 0.7: production-ready for most applications
6. Loss Functions Summary
| Loss | Formula | Use case |
|---|---|---|
| MSE | $\frac{1}{N}\sum(y-\hat{y})^2$ | Regression (sensitive to outliers) |
| Smooth L1 | Quadratic for $ | e |
| BCE | $-[y\log p + (1-y)\log(1-p)]$ | Binary classification |
| Cross-entropy | $-\sum y_c \log p_c$ | Multi-class classification |
| Focal | $-(1-p_t)^\gamma \log(p_t)$ | Class-imbalanced detection |
| Dice | $1 - \frac{2 | A\cap B |
| CIoU | $1 - \text{IoU} + \text{distance} + \text{aspect ratio}$ | Box regression |
| Triplet | $\max(d(a,p) - d(a,n) + \text{margin}, 0)$ | Metric learning, face recognition |
7. Normalization Layers
| Layer | Normalized over | Use case |
|---|---|---|
| Batch Norm | Per-channel, over batch+spatial | CNNs (batch ≥ 4) |
| Layer Norm | Per-sample, over all features | Transformers, NLP |
| Instance Norm | Per-channel, per-sample | Style transfer |
| Group Norm | Per-channel group, per-sample | Detection (small batch) |
| Sync BN | Like BN but sync across DDP ranks | Distributed training |
Why BN fails with batch=1: variance estimate is 0, no normalization happens. Use GN or IN instead.
8. GPU/Hardware
Memory Breakdown for Training (ResNet-50, batch=64)
- Parameters (FP32): 25M × 4 = 100 MB
- Gradients: same as params = 100 MB
- Optimizer state (Adam): 2× params = 200 MB
- Activations (for backprop): ~1-5 GB (dominant cost)
Reducing Memory
- Mixed precision (FP16/BF16): halve parameter+gradient memory
- Gradient checkpointing: recompute activations on backward, save only checkpoints
- FSDP: shard model+optimizer across GPUs
- Reduce batch size: decrease activation memory
Throughput Bottlenecks
- Kernel launch overhead: use larger batches
- Memory bandwidth: use tensor cores (multiple of 8 dims)
- Data loading: use pin_memory=True, num_workers=4-8
- PCIe bandwidth: use CUDA streams, async transfers
9. Common Interview Pitfalls
"What's the difference between overfitting and high variance?"
They're the same thing. Overfitting = high variance = model memorizes training noise, fails to generalize.
"When does batch norm hurt?"
- Very small batches (< 4) — variance estimate unreliable
- Very deep networks with gradient checkpointing — BN stats can be stale
- Online fine-tuning with different distribution — BN running stats mismatch
"How do you debug a model that won't train?"
- Check loss on a single batch first — should decrease with enough capacity
- Verify data loading (visualize a batch)
- Check for NaN/Inf in outputs (exploding gradients or bad initialization)
- Monitor gradient norms per layer
- Reduce to simplest possible model, add complexity incrementally
10. System Design Quick Reference
5-Step Framework
- Clarify: Latency? Accuracy? Scale? Online or batch?
- Estimate: Data size, compute, bandwidth needed
- Design: Pipeline stages, data flow
- Scale: Bottlenecks, horizontal scaling, caching
- Monitor: Metrics, alerts, drift detection
Common Tradeoffs
- Latency vs Throughput: batch size increases throughput, increases latency
- Accuracy vs Speed: smaller model, quantization, pruning
- Real-time vs Batch: streaming (Kafka + GPU workers) vs MapReduce
- Consistency vs Availability (CAP): detection results cached may be stale