Interview Prep — System Design Walkthroughs

Five complete system design answers with diagrams, tradeoffs, and estimates. Practice answering each in 45 minutes. The structure: Clarify → Estimate → Design → Scale → Monitor.

Walkthrough 1: Real-Time Object Detection at Scale

Prompt: Design a system to run object detection on 1000 camera feeds in real time.

Step 1 — Clarify Requirements

Latency: < 200ms end-to-end (camera → alert)
Throughput: 1000 cameras × 30 FPS = 30,000 frames/second
Accuracy: mAP@0.5 > 0.65 on your object classes
Scale: horizontally scalable to 10,000 cameras

Step 2 — Back-of-Envelope Estimates

YOLOv8m: ~25ms/frame on A100 (batch=1)
Batch=32 → ~2ms/frame effective → 500 fps/GPU
30,000 FPS ÷ 500 = 60 A100 GPUs (add 30% buffer → 80 GPUs)
Storage: 1000 cams × 30 FPS × 50 KB/frame = 1.5 GB/s compressed → 5 TB/hour

Step 3 — Architecture

Cameras (RTSP)
    │ (PyAV / FFmpeg)
    ▼
Kafka (topic: raw_frames)    ← partitioned by camera_id
    │
    ├─ GPU Worker Pool (80× A100)
    │      ├─ Dynamic batching (wait ≤ 5ms or 32 frames)
    │      ├─ TensorRT FP16 engine
    │      └─ NMS post-processing (torchvision.ops.batched_nms)
    │
    ├─→ Kafka (topic: detections)   ← JSON events
    │
    ├─→ TimescaleDB (time-series metrics per camera)
    │
    └─→ Alert Service (thresholds → PagerDuty / Slack)

Step 4 — Scale Decisions

Decision	Choice	Why
Frame queue	Kafka	Back-pressure, replay, fan-out
Batching strategy	Dynamic (max 32, max 5ms)	Balance latency vs throughput
GPU scheduling	NVIDIA Triton	Built-in dynamic batching
Model format	TensorRT FP16	3-5× faster than PyTorch
Camera sharding	camera_id % n_partitions	Even load distribution

Step 5 — Monitor

GPU utilization (target > 80%)
Queue lag (Kafka consumer lag < 1000 messages)
p99 inference latency
mAP drift (weekly evaluation against labeled validation set)
False positive rate per camera (per-site calibration needed)

Walkthrough 2: Face Recognition System

Prompt: Design a face recognition system for a 10,000-employee company.

Step 1 — Clarify

Use case: door access control (security) + attendance tracking
Latency: < 500ms for live access decisions
Scale: 10K employees, ~50 doors, ~1000 face lookups/minute peak
Accuracy: FAR (False Accept Rate) < 0.01%, FRR (False Reject Rate) < 1%

Step 2 — Estimates

MTCNN face detection: ~20ms/frame
ArcFace embedding: ~10ms/frame (ResNet-50 backbone)
FAISS flat search over 10K faces: ~1ms
Total: ~31ms → well within 500ms budget
Storage: 10K employees × 1 embedding × 512 floats × 4 bytes = 20 MB (trivial)

Step 3 — Architecture

Camera Frame
    │
    ▼
Face Detection (MTCNN)         ← detect + align face to 112×112
    │
    ▼
Quality Filter                  ← reject blurry, occluded, non-frontal
    │ (Laplacian variance > 100, face area > 5% of frame)
    ▼
ArcFace Embedding               ← 512-dim L2-normalized vector
    │
    ▼
FAISS IndexFlatIP               ← cosine similarity search
    │ similarity threshold: 0.65
    ├── Match found → employee_id → access_log → grant/deny
    └── No match → flag for security review

Step 4 — Database & Enrollment

# Enrollment: add new employee
embedding = arcface_model(align_face(img))  # (512,) normalized
faiss_index.add(embedding.reshape(1, -1))
employee_db[faiss_index.ntotal - 1] = employee_id

# For production: use IndexIVFFlat for > 1M faces
# nlist = 100 centroids, nprobe = 10 → 10x speedup vs flat
index = faiss.IndexIVFFlat(quantizer, 512, 100, faiss.METRIC_INNER_PRODUCT)
index.train(all_embeddings)

Step 5 — Anti-Spoofing

Liveness detection: check for eye blink, head movement, or use IR depth camera
Presentation attack: binary classifier on face texture (MobileNetV2, trained on fake face datasets)
Audit log: store encrypted embedding + timestamp for compliance

Walkthrough 3: Video Content Moderation

Prompt: Design a system that moderates 1M videos/day for inappropriate content.

Step 1 — Clarify

Latency: async (within 5 minutes of upload is fine)
Scale: 1M videos/day = ~12 videos/second average, 100 peak
Content: violence, adult content, hate symbols
SLA: < 0.1% harmful content reaches users

Step 2 — Pipeline Design

Video Upload (S3)
    │
    ▼
Frame Sampling Service          ← 1 fps for most videos
    │ skip identical frames (perceptual hash)          
    ▼
Multi-Label Classifier          ← EfficientNet-B4 → 5 categories
    │ batch=64, A100, ~500 fps
    ▼
Risk Scorer                     ← max(category_scores) × duration_weight
    │
    ├── score < 0.3 → Auto-Approve
    ├── score 0.3-0.7 → Human Review Queue (Mechanical Turk)
    └── score > 0.7 → Auto-Reject + notify uploader

Step 3 — Frame Sampling Strategy

def sample_frames(video_path, fps=1.0, max_frames=300):
    """
    1 FPS + dedup = covers 99% of content with minimal compute.
    """
    cap = cv2.VideoCapture(video_path)
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    interval = max(1, int(video_fps / fps))
    
    frames, prev_hash = [], None
    i = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break
        if i % interval == 0:
            h = dhash(cv2.resize(frame, (8, 8)))
            if prev_hash is None or bin(h ^ prev_hash).count('1') > 5:
                frames.append(frame)
                prev_hash = h
        i += 1
    return frames[:max_frames]

Step 4 — Human Review Optimization

Prioritize: sort review queue by (risk_score × video_length)
Context: show reviewer 3 highest-risk frames + metadata
Feedback loop: reviewer decisions → retrain classifier weekly
Active learning: add uncertain predictions (0.4-0.6 score) to training set

Walkthrough 4: Autonomous Vehicle Perception Pipeline

Prompt: Design the perception stack for a Level 2 ADAS system.

Step 1 — Requirements

Sensors: 8 cameras (surround), 1 LiDAR, 4 radar
Latency: < 50ms end-to-end (33ms = 30 Hz)
Safety: must detect pedestrians at 50m with > 99.9% recall
Compute: embedded (NVIDIA Orin, 254 TOPS)

Step 2 — Architecture

Cameras (8×) → ISP → JPEG decode → GPU memory
LiDAR        → Point cloud → voxelization
Radar        → CFAR detection → velocity clusters

                    │
                    ▼
        ┌─── BEV Feature Extractor ───┐
        │  Camera: LSS / BEVFusion    │
        │  LiDAR:  PointPillars       │
        └─────────────────────────────┘
                    │ Bird's Eye View (BEV) feature map
                    ▼
        ┌─── 3D Object Detection ─────┐   CenterPoint / DETR3D
        ├─── Lane Detection ──────────┤   BezierLaneNet
        └─── Occupancy Prediction ────┘   Tesla Occupancy Networks
                    │
                    ▼
              Sensor Fusion           ← Kalman Filter per track
                    │
                    ▼
          HD Map + Ego Pose           ← RT-SLAM / GPS+IMU
                    │
                    ▼
          Planning Interface          ← object list + velocity vectors

Step 3 — Latency Budget (50ms)

Stage	Budget
Sensor capture + DMA	5ms
Preprocessing (debayer, resize)	3ms
BEV feature extraction	18ms
Detection heads	8ms
Sensor fusion + tracking	5ms
Output marshaling	1ms
Total	40ms (10ms margin)

Step 4 — Safety Considerations

Redundancy: radar provides independent velocity estimates
OOD detection: uncertainty heads on detection model; trigger conservative behavior
Temporal consistency: detections must be tracked ≥ 3 frames before acting on them
Simulation testing: 1 billion virtual miles before road testing

Walkthrough 5: Medical Image Diagnosis System

Prompt: Design an AI system to assist radiologists reading chest X-rays.

Step 1 — Clarify

Task: multi-label classification (14 pathologies) + localization
Scale: 10K X-rays/day, 200 hospitals
Latency: < 5 seconds (radiologist sees AI result before reading)
Regulatory: FDA 510(k) clearance needed → explainability required

Step 2 — Architecture

DICOM Upload (hospital PACS)
    │
    ▼
DICOM Parser + Normalization     ← pydicom, window-level normalization
    │
    ▼
Quality Filter                   ← check for rotation, artifacts, exposure
    │
    ▼
DenseNet-121 (CheXNet-style)     ← pretrained on CheXpert/NIH-14
    │
    ├── 14 pathology scores       ← sigmoid output, 0-1 confidence
    │
    └── GradCAM heatmaps          ← highlight regions driving prediction
                    │
                    ▼
          Radiologist Dashboard   ← highlight boxes + confidence scores
                    │
                    ▼
          Human Decision          ← radiologist confirms/overrides
                    │
                    ▼
          Feedback Loop           ← confirmed cases → re-training dataset

Step 3 — MLflow Experiment Tracking

import mlflow

with mlflow.start_run(run_name="densenet121_chexpert_v3"):
    mlflow.log_params({"model": "DenseNet121", "pretrain": "CheXpert", "epochs": 50})
    for epoch in range(epochs):
        metrics = evaluate(model, val_loader)
        mlflow.log_metrics(metrics, step=epoch)
    mlflow.pytorch.log_model(model, "model",
        registered_model_name="chest_xray_classifier")

Step 4 — Regulatory Compliance

Explainability: GradCAM mandatory for FDA submission
Bias auditing: validate AUC separately for age/gender/race subgroups
Model versioning: every deployed model version tracked in registry
Shadow deployment: new model runs in parallel with existing for 30 days before replacement
Uncertainty quantification: MC Dropout → flag low-confidence cases for mandatory human review

AI Engineer — Role-Based Learning Hub