CV Engineer Interview Prep — Concepts Cheatsheet

Quick reference for every topic that comes up in CV engineer interviews. Use this the week before your interview for rapid review.

1. Convolutional Neural Networks

Core Operations

Operation	Formula	Purpose
Convolution	$y_{i,j} = \sum_{m,n} x_{i+m,j+n} \cdot k_{m,n}$	Feature extraction
Max pooling	$y = \max_{w \times w \text{ region}} x$	Spatial invariance, dimensionality reduction
Global avg pool	$y = \frac{1}{HW} \sum_{i,j} x_{i,j}$	Replace FC layers, parameter reduction
Depthwise sep conv	DW conv + pointwise conv	MobileNet: 8-9× fewer FLOPs than regular conv

Receptive Field

$$RF_k = RF_{k-1} + (f_k - 1) \cdot \prod_{i<k} s_i$$

where $f_k$ = kernel size at layer $k$, $s_i$ = stride at layer $i$.

Dilation: multiply receptive field without reducing spatial resolution. Dilated conv with rate $d$: gaps of $d-1$ between kernel elements. $RF = (2d+1) \times (2d+1)$ for 3×3 kernel.

2. Optimization

Gradient Descent Variants

Method	Update	When to use
SGD	$\theta \leftarrow \theta - \alpha g$	Large datasets, sparse updates
SGD+Momentum	$v \leftarrow \beta v + g$; $\theta \leftarrow \theta - \alpha v$	Faster convergence, escapes local minima
RMSprop	$v \leftarrow \beta v + (1-\beta)g^2$; $\theta \leftarrow \theta - \alpha g/\sqrt{v+\epsilon}$	Non-stationary, RNNs
Adam	1st + 2nd moment with bias correction	Default choice for most tasks
AdamW	Adam + decoupled weight decay	Transformers, large models

Learning Rate Schedules

Warmup + cosine decay: standard for transformers. Prevents early instability.
OneCycleLR: fast training, often best for CNNs.
Linear scaling rule: multiply LR by $k$ when batch size is $k \times$ baseline.
Gradient clipping: clip norm to 1.0 — prevents exploding gradients in RNNs/transformers.

3. Regularization

Technique	How it works	When to use
L2 (weight decay)	Penalize $\|\theta\|^2$ — shrinks weights toward 0	Always, standard
Dropout	Zero activations with prob $p$, scale by $1/(1-p)$	FC layers, transformers
Batch Norm	Normalize activations within batch	CNNs, stabilizes training
Data augmentation	Artificially expand training set	All tasks
Label smoothing	Replace hard 0/1 with $\epsilon/(C-1)$ / $1-\epsilon$	Classification, large datasets
Mixup	Blend two images: $\tilde{x} = \lambda x_i + (1-\lambda)x_j$	Classification, detection
CutMix	Cut patch from one image, paste into another	Segmentation awareness

4. Architectures

ResNet — Skip Connections

$$\mathcal{F}(x) = H(x) - x \rightarrow \text{learn residual, not full mapping}$$

Key insight: gradient can flow directly through skip connection. Solves vanishing gradient for 100+ layer networks.

Bottleneck block: 1×1→3×3→1×1 convolutions. Reduces channels before 3×3, expands after. 4× fewer FLOPs than basic block at same capacity.

EfficientNet — Compound Scaling

Scale depth $d = \alpha^\phi$, width $w = \beta^\phi$, resolution $r = \gamma^\phi$ such that $\alpha\beta^2\gamma^2 \approx 2$ (FLOPs double per step).

Vision Transformer (ViT)

Split image into $P \times P$ patches (typically 16×16)
Linear projection → sequence of tokens
Add [CLS] token + positional embeddings
Stack Transformer encoder layers
Classify using [CLS] output

Limitation: requires more data than CNNs (no inductive bias). Pretrain on JFT-300M or use DeiT data augmentation.

5. Object Detection

Single-Stage vs Two-Stage

	Single-Stage (YOLO, SSD)	Two-Stage (Faster R-CNN)
Speed	Fast (30-100+ FPS)	Slow (5-15 FPS)
Accuracy	Good	Better (especially small objects)
Anchors	Yes (YOLO v3-v5) or no (v8)	Yes (RPN)
Use case	Real-time	High-accuracy offline

Key Metrics

mAP@0.5: IoU threshold = 0.5 for TP/FP determination
mAP@0.5:0.95: COCO metric, average over [0.5, 0.55, ..., 0.95]
AP50 > 0.7: production-ready for most applications

6. Loss Functions Summary

Loss	Formula	Use case
MSE	$\frac{1}{N}\sum(y-\hat{y})^2$	Regression (sensitive to outliers)
Smooth L1	Quadratic for $	e
BCE	$-[y\log p + (1-y)\log(1-p)]$	Binary classification
Cross-entropy	$-\sum y_c \log p_c$	Multi-class classification
Focal	$-(1-p_t)^\gamma \log(p_t)$	Class-imbalanced detection
Dice	$1 - \frac{2	A\cap B
CIoU	$1 - \text{IoU} + \text{distance} + \text{aspect ratio}$	Box regression
Triplet	$\max(d(a,p) - d(a,n) + \text{margin}, 0)$	Metric learning, face recognition

7. Normalization Layers

Layer	Normalized over	Use case
Batch Norm	Per-channel, over batch+spatial	CNNs (batch ≥ 4)
Layer Norm	Per-sample, over all features	Transformers, NLP
Instance Norm	Per-channel, per-sample	Style transfer
Group Norm	Per-channel group, per-sample	Detection (small batch)
Sync BN	Like BN but sync across DDP ranks	Distributed training

Why BN fails with batch=1: variance estimate is 0, no normalization happens. Use GN or IN instead.

8. GPU/Hardware

Memory Breakdown for Training (ResNet-50, batch=64)

Parameters (FP32): 25M × 4 = 100 MB
Gradients: same as params = 100 MB
Optimizer state (Adam): 2× params = 200 MB
Activations (for backprop): ~1-5 GB (dominant cost)

Reducing Memory

Mixed precision (FP16/BF16): halve parameter+gradient memory
Gradient checkpointing: recompute activations on backward, save only checkpoints
FSDP: shard model+optimizer across GPUs
Reduce batch size: decrease activation memory

Throughput Bottlenecks

Kernel launch overhead: use larger batches
Memory bandwidth: use tensor cores (multiple of 8 dims)
Data loading: use pin_memory=True, num_workers=4-8
PCIe bandwidth: use CUDA streams, async transfers

Very small batches (< 4) — variance estimate unreliable
Very deep networks with gradient checkpointing — BN stats can be stale
Online fine-tuning with different distribution — BN running stats mismatch

"How do you debug a model that won't train?"

Check loss on a single batch first — should decrease with enough capacity
Verify data loading (visualize a batch)
Check for NaN/Inf in outputs (exploding gradients or bad initialization)
Monitor gradient norms per layer
Reduce to simplest possible model, add complexity incrementally

10. System Design Quick Reference

5-Step Framework

Clarify: Latency? Accuracy? Scale? Online or batch?
Estimate: Data size, compute, bandwidth needed
Design: Pipeline stages, data flow
Scale: Bottlenecks, horizontal scaling, caching
Monitor: Metrics, alerts, drift detection

Common Tradeoffs

Latency vs Throughput: batch size increases throughput, increases latency
Accuracy vs Speed: smaller model, quantization, pruning
Real-time vs Batch: streaming (Kafka + GPU workers) vs MapReduce
Consistency vs Availability (CAP): detection results cached may be stale

AI Engineer — Role-Based Learning Hub