CV Engineer Interview Prep — Concepts Cheatsheet

Quick reference for every topic that comes up in CV engineer interviews. Use this the week before your interview for rapid review.


1. Convolutional Neural Networks

Core Operations

OperationFormulaPurpose
Convolution$y_{i,j} = \sum_{m,n} x_{i+m,j+n} \cdot k_{m,n}$Feature extraction
Max pooling$y = \max_{w \times w \text{ region}} x$Spatial invariance, dimensionality reduction
Global avg pool$y = \frac{1}{HW} \sum_{i,j} x_{i,j}$Replace FC layers, parameter reduction
Depthwise sep convDW conv + pointwise convMobileNet: 8-9× fewer FLOPs than regular conv

Receptive Field

$$RF_k = RF_{k-1} + (f_k - 1) \cdot \prod_{i<k} s_i$$

where $f_k$ = kernel size at layer $k$, $s_i$ = stride at layer $i$.

Dilation: multiply receptive field without reducing spatial resolution. Dilated conv with rate $d$: gaps of $d-1$ between kernel elements. $RF = (2d+1) \times (2d+1)$ for 3×3 kernel.


2. Optimization

Gradient Descent Variants

MethodUpdateWhen to use
SGD$\theta \leftarrow \theta - \alpha g$Large datasets, sparse updates
SGD+Momentum$v \leftarrow \beta v + g$; $\theta \leftarrow \theta - \alpha v$Faster convergence, escapes local minima
RMSprop$v \leftarrow \beta v + (1-\beta)g^2$; $\theta \leftarrow \theta - \alpha g/\sqrt{v+\epsilon}$Non-stationary, RNNs
Adam1st + 2nd moment with bias correctionDefault choice for most tasks
AdamWAdam + decoupled weight decayTransformers, large models

Learning Rate Schedules

  • Warmup + cosine decay: standard for transformers. Prevents early instability.
  • OneCycleLR: fast training, often best for CNNs.
  • Linear scaling rule: multiply LR by $k$ when batch size is $k \times$ baseline.
  • Gradient clipping: clip norm to 1.0 — prevents exploding gradients in RNNs/transformers.

3. Regularization

TechniqueHow it worksWhen to use
L2 (weight decay)Penalize $|\theta|^2$ — shrinks weights toward 0Always, standard
DropoutZero activations with prob $p$, scale by $1/(1-p)$FC layers, transformers
Batch NormNormalize activations within batchCNNs, stabilizes training
Data augmentationArtificially expand training setAll tasks
Label smoothingReplace hard 0/1 with $\epsilon/(C-1)$ / $1-\epsilon$Classification, large datasets
MixupBlend two images: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$Classification, detection
CutMixCut patch from one image, paste into anotherSegmentation awareness

4. Architectures

ResNet — Skip Connections

$$\mathcal{F}(x) = H(x) - x \rightarrow \text{learn residual, not full mapping}$$

Key insight: gradient can flow directly through skip connection. Solves vanishing gradient for 100+ layer networks.

Bottleneck block: 1×1→3×3→1×1 convolutions. Reduces channels before 3×3, expands after. 4× fewer FLOPs than basic block at same capacity.

EfficientNet — Compound Scaling

Scale depth $d = \alpha^\phi$, width $w = \beta^\phi$, resolution $r = \gamma^\phi$ such that $\alpha\beta^2\gamma^2 \approx 2$ (FLOPs double per step).

Vision Transformer (ViT)

  • Split image into $P \times P$ patches (typically 16×16)
  • Linear projection → sequence of tokens
  • Add [CLS] token + positional embeddings
  • Stack Transformer encoder layers
  • Classify using [CLS] output

Limitation: requires more data than CNNs (no inductive bias). Pretrain on JFT-300M or use DeiT data augmentation.


5. Object Detection

Single-Stage vs Two-Stage

Single-Stage (YOLO, SSD)Two-Stage (Faster R-CNN)
SpeedFast (30-100+ FPS)Slow (5-15 FPS)
AccuracyGoodBetter (especially small objects)
AnchorsYes (YOLO v3-v5) or no (v8)Yes (RPN)
Use caseReal-timeHigh-accuracy offline

Key Metrics

  • mAP@0.5: IoU threshold = 0.5 for TP/FP determination
  • mAP@0.5:0.95: COCO metric, average over [0.5, 0.55, ..., 0.95]
  • AP50 > 0.7: production-ready for most applications

6. Loss Functions Summary

LossFormulaUse case
MSE$\frac{1}{N}\sum(y-\hat{y})^2$Regression (sensitive to outliers)
Smooth L1Quadratic for $e
BCE$-[y\log p + (1-y)\log(1-p)]$Binary classification
Cross-entropy$-\sum y_c \log p_c$Multi-class classification
Focal$-(1-p_t)^\gamma \log(p_t)$Class-imbalanced detection
Dice$1 - \frac{2A\cap B
CIoU$1 - \text{IoU} + \text{distance} + \text{aspect ratio}$Box regression
Triplet$\max(d(a,p) - d(a,n) + \text{margin}, 0)$Metric learning, face recognition

7. Normalization Layers

LayerNormalized overUse case
Batch NormPer-channel, over batch+spatialCNNs (batch ≥ 4)
Layer NormPer-sample, over all featuresTransformers, NLP
Instance NormPer-channel, per-sampleStyle transfer
Group NormPer-channel group, per-sampleDetection (small batch)
Sync BNLike BN but sync across DDP ranksDistributed training

Why BN fails with batch=1: variance estimate is 0, no normalization happens. Use GN or IN instead.


8. GPU/Hardware

Memory Breakdown for Training (ResNet-50, batch=64)

  • Parameters (FP32): 25M × 4 = 100 MB
  • Gradients: same as params = 100 MB
  • Optimizer state (Adam): 2× params = 200 MB
  • Activations (for backprop): ~1-5 GB (dominant cost)

Reducing Memory

  1. Mixed precision (FP16/BF16): halve parameter+gradient memory
  2. Gradient checkpointing: recompute activations on backward, save only checkpoints
  3. FSDP: shard model+optimizer across GPUs
  4. Reduce batch size: decrease activation memory

Throughput Bottlenecks

  1. Kernel launch overhead: use larger batches
  2. Memory bandwidth: use tensor cores (multiple of 8 dims)
  3. Data loading: use pin_memory=True, num_workers=4-8
  4. PCIe bandwidth: use CUDA streams, async transfers

9. Common Interview Pitfalls

"What's the difference between overfitting and high variance?"

They're the same thing. Overfitting = high variance = model memorizes training noise, fails to generalize.

"When does batch norm hurt?"

  1. Very small batches (< 4) — variance estimate unreliable
  2. Very deep networks with gradient checkpointing — BN stats can be stale
  3. Online fine-tuning with different distribution — BN running stats mismatch

"How do you debug a model that won't train?"

  1. Check loss on a single batch first — should decrease with enough capacity
  2. Verify data loading (visualize a batch)
  3. Check for NaN/Inf in outputs (exploding gradients or bad initialization)
  4. Monitor gradient norms per layer
  5. Reduce to simplest possible model, add complexity incrementally

10. System Design Quick Reference

5-Step Framework

  1. Clarify: Latency? Accuracy? Scale? Online or batch?
  2. Estimate: Data size, compute, bandwidth needed
  3. Design: Pipeline stages, data flow
  4. Scale: Bottlenecks, horizontal scaling, caching
  5. Monitor: Metrics, alerts, drift detection

Common Tradeoffs

  • Latency vs Throughput: batch size increases throughput, increases latency
  • Accuracy vs Speed: smaller model, quantization, pruning
  • Real-time vs Batch: streaming (Kafka + GPU workers) vs MapReduce
  • Consistency vs Availability (CAP): detection results cached may be stale