Interview Prep — System Design Walkthroughs
Five complete system design answers with diagrams, tradeoffs, and estimates. Practice answering each in 45 minutes. The structure: Clarify → Estimate → Design → Scale → Monitor.
Walkthrough 1: Real-Time Object Detection at Scale
Prompt: Design a system to run object detection on 1000 camera feeds in real time.
Step 1 — Clarify Requirements
- Latency: < 200ms end-to-end (camera → alert)
- Throughput: 1000 cameras × 30 FPS = 30,000 frames/second
- Accuracy: mAP@0.5 > 0.65 on your object classes
- Scale: horizontally scalable to 10,000 cameras
Step 2 — Back-of-Envelope Estimates
- YOLOv8m: ~25ms/frame on A100 (batch=1)
- Batch=32 → ~2ms/frame effective → 500 fps/GPU
- 30,000 FPS ÷ 500 = 60 A100 GPUs (add 30% buffer → 80 GPUs)
- Storage: 1000 cams × 30 FPS × 50 KB/frame = 1.5 GB/s compressed → 5 TB/hour
Step 3 — Architecture
Cameras (RTSP)
│ (PyAV / FFmpeg)
▼
Kafka (topic: raw_frames) ← partitioned by camera_id
│
├─ GPU Worker Pool (80× A100)
│ ├─ Dynamic batching (wait ≤ 5ms or 32 frames)
│ ├─ TensorRT FP16 engine
│ └─ NMS post-processing (torchvision.ops.batched_nms)
│
├─→ Kafka (topic: detections) ← JSON events
│
├─→ TimescaleDB (time-series metrics per camera)
│
└─→ Alert Service (thresholds → PagerDuty / Slack)
Step 4 — Scale Decisions
| Decision | Choice | Why |
|---|---|---|
| Frame queue | Kafka | Back-pressure, replay, fan-out |
| Batching strategy | Dynamic (max 32, max 5ms) | Balance latency vs throughput |
| GPU scheduling | NVIDIA Triton | Built-in dynamic batching |
| Model format | TensorRT FP16 | 3-5× faster than PyTorch |
| Camera sharding | camera_id % n_partitions | Even load distribution |
Step 5 — Monitor
- GPU utilization (target > 80%)
- Queue lag (Kafka consumer lag < 1000 messages)
- p99 inference latency
- mAP drift (weekly evaluation against labeled validation set)
- False positive rate per camera (per-site calibration needed)
Walkthrough 2: Face Recognition System
Prompt: Design a face recognition system for a 10,000-employee company.
Step 1 — Clarify
- Use case: door access control (security) + attendance tracking
- Latency: < 500ms for live access decisions
- Scale: 10K employees, ~50 doors, ~1000 face lookups/minute peak
- Accuracy: FAR (False Accept Rate) < 0.01%, FRR (False Reject Rate) < 1%
Step 2 — Estimates
- MTCNN face detection: ~20ms/frame
- ArcFace embedding: ~10ms/frame (ResNet-50 backbone)
- FAISS flat search over 10K faces: ~1ms
- Total: ~31ms → well within 500ms budget
- Storage: 10K employees × 1 embedding × 512 floats × 4 bytes = 20 MB (trivial)
Step 3 — Architecture
Camera Frame
│
▼
Face Detection (MTCNN) ← detect + align face to 112×112
│
▼
Quality Filter ← reject blurry, occluded, non-frontal
│ (Laplacian variance > 100, face area > 5% of frame)
▼
ArcFace Embedding ← 512-dim L2-normalized vector
│
▼
FAISS IndexFlatIP ← cosine similarity search
│ similarity threshold: 0.65
├── Match found → employee_id → access_log → grant/deny
└── No match → flag for security review
Step 4 — Database & Enrollment
# Enrollment: add new employee
embedding = arcface_model(align_face(img)) # (512,) normalized
faiss_index.add(embedding.reshape(1, -1))
employee_db[faiss_index.ntotal - 1] = employee_id
# For production: use IndexIVFFlat for > 1M faces
# nlist = 100 centroids, nprobe = 10 → 10x speedup vs flat
index = faiss.IndexIVFFlat(quantizer, 512, 100, faiss.METRIC_INNER_PRODUCT)
index.train(all_embeddings)
Step 5 — Anti-Spoofing
- Liveness detection: check for eye blink, head movement, or use IR depth camera
- Presentation attack: binary classifier on face texture (MobileNetV2, trained on fake face datasets)
- Audit log: store encrypted embedding + timestamp for compliance
Walkthrough 3: Video Content Moderation
Prompt: Design a system that moderates 1M videos/day for inappropriate content.
Step 1 — Clarify
- Latency: async (within 5 minutes of upload is fine)
- Scale: 1M videos/day = ~12 videos/second average, 100 peak
- Content: violence, adult content, hate symbols
- SLA: < 0.1% harmful content reaches users
Step 2 — Pipeline Design
Video Upload (S3)
│
▼
Frame Sampling Service ← 1 fps for most videos
│ skip identical frames (perceptual hash)
▼
Multi-Label Classifier ← EfficientNet-B4 → 5 categories
│ batch=64, A100, ~500 fps
▼
Risk Scorer ← max(category_scores) × duration_weight
│
├── score < 0.3 → Auto-Approve
├── score 0.3-0.7 → Human Review Queue (Mechanical Turk)
└── score > 0.7 → Auto-Reject + notify uploader
Step 3 — Frame Sampling Strategy
def sample_frames(video_path, fps=1.0, max_frames=300):
"""
1 FPS + dedup = covers 99% of content with minimal compute.
"""
cap = cv2.VideoCapture(video_path)
video_fps = cap.get(cv2.CAP_PROP_FPS)
interval = max(1, int(video_fps / fps))
frames, prev_hash = [], None
i = 0
while cap.isOpened():
ret, frame = cap.read()
if not ret: break
if i % interval == 0:
h = dhash(cv2.resize(frame, (8, 8)))
if prev_hash is None or bin(h ^ prev_hash).count('1') > 5:
frames.append(frame)
prev_hash = h
i += 1
return frames[:max_frames]
Step 4 — Human Review Optimization
- Prioritize: sort review queue by (risk_score × video_length)
- Context: show reviewer 3 highest-risk frames + metadata
- Feedback loop: reviewer decisions → retrain classifier weekly
- Active learning: add uncertain predictions (0.4-0.6 score) to training set
Walkthrough 4: Autonomous Vehicle Perception Pipeline
Prompt: Design the perception stack for a Level 2 ADAS system.
Step 1 — Requirements
- Sensors: 8 cameras (surround), 1 LiDAR, 4 radar
- Latency: < 50ms end-to-end (33ms = 30 Hz)
- Safety: must detect pedestrians at 50m with > 99.9% recall
- Compute: embedded (NVIDIA Orin, 254 TOPS)
Step 2 — Architecture
Cameras (8×) → ISP → JPEG decode → GPU memory
LiDAR → Point cloud → voxelization
Radar → CFAR detection → velocity clusters
│
▼
┌─── BEV Feature Extractor ───┐
│ Camera: LSS / BEVFusion │
│ LiDAR: PointPillars │
└─────────────────────────────┘
│ Bird's Eye View (BEV) feature map
▼
┌─── 3D Object Detection ─────┐ CenterPoint / DETR3D
├─── Lane Detection ──────────┤ BezierLaneNet
└─── Occupancy Prediction ────┘ Tesla Occupancy Networks
│
▼
Sensor Fusion ← Kalman Filter per track
│
▼
HD Map + Ego Pose ← RT-SLAM / GPS+IMU
│
▼
Planning Interface ← object list + velocity vectors
Step 3 — Latency Budget (50ms)
| Stage | Budget |
|---|---|
| Sensor capture + DMA | 5ms |
| Preprocessing (debayer, resize) | 3ms |
| BEV feature extraction | 18ms |
| Detection heads | 8ms |
| Sensor fusion + tracking | 5ms |
| Output marshaling | 1ms |
| Total | 40ms (10ms margin) |
Step 4 — Safety Considerations
- Redundancy: radar provides independent velocity estimates
- OOD detection: uncertainty heads on detection model; trigger conservative behavior
- Temporal consistency: detections must be tracked ≥ 3 frames before acting on them
- Simulation testing: 1 billion virtual miles before road testing
Walkthrough 5: Medical Image Diagnosis System
Prompt: Design an AI system to assist radiologists reading chest X-rays.
Step 1 — Clarify
- Task: multi-label classification (14 pathologies) + localization
- Scale: 10K X-rays/day, 200 hospitals
- Latency: < 5 seconds (radiologist sees AI result before reading)
- Regulatory: FDA 510(k) clearance needed → explainability required
Step 2 — Architecture
DICOM Upload (hospital PACS)
│
▼
DICOM Parser + Normalization ← pydicom, window-level normalization
│
▼
Quality Filter ← check for rotation, artifacts, exposure
│
▼
DenseNet-121 (CheXNet-style) ← pretrained on CheXpert/NIH-14
│
├── 14 pathology scores ← sigmoid output, 0-1 confidence
│
└── GradCAM heatmaps ← highlight regions driving prediction
│
▼
Radiologist Dashboard ← highlight boxes + confidence scores
│
▼
Human Decision ← radiologist confirms/overrides
│
▼
Feedback Loop ← confirmed cases → re-training dataset
Step 3 — MLflow Experiment Tracking
import mlflow
with mlflow.start_run(run_name="densenet121_chexpert_v3"):
mlflow.log_params({"model": "DenseNet121", "pretrain": "CheXpert", "epochs": 50})
for epoch in range(epochs):
metrics = evaluate(model, val_loader)
mlflow.log_metrics(metrics, step=epoch)
mlflow.pytorch.log_model(model, "model",
registered_model_name="chest_xray_classifier")
Step 4 — Regulatory Compliance
- Explainability: GradCAM mandatory for FDA submission
- Bias auditing: validate AUC separately for age/gender/race subgroups
- Model versioning: every deployed model version tracked in registry
- Shadow deployment: new model runs in parallel with existing for 30 days before replacement
- Uncertainty quantification: MC Dropout → flag low-confidence cases for mandatory human review