Interview Prep — System Design Walkthroughs

Five complete system design answers with diagrams, tradeoffs, and estimates. Practice answering each in 45 minutes. The structure: Clarify → Estimate → Design → Scale → Monitor.


Walkthrough 1: Real-Time Object Detection at Scale

Prompt: Design a system to run object detection on 1000 camera feeds in real time.

Step 1 — Clarify Requirements

  • Latency: < 200ms end-to-end (camera → alert)
  • Throughput: 1000 cameras × 30 FPS = 30,000 frames/second
  • Accuracy: mAP@0.5 > 0.65 on your object classes
  • Scale: horizontally scalable to 10,000 cameras

Step 2 — Back-of-Envelope Estimates

  • YOLOv8m: ~25ms/frame on A100 (batch=1)
  • Batch=32 → ~2ms/frame effective → 500 fps/GPU
  • 30,000 FPS ÷ 500 = 60 A100 GPUs (add 30% buffer → 80 GPUs)
  • Storage: 1000 cams × 30 FPS × 50 KB/frame = 1.5 GB/s compressed → 5 TB/hour

Step 3 — Architecture

Cameras (RTSP)
    │ (PyAV / FFmpeg)
    ▼
Kafka (topic: raw_frames)    ← partitioned by camera_id
    │
    ├─ GPU Worker Pool (80× A100)
    │      ├─ Dynamic batching (wait ≤ 5ms or 32 frames)
    │      ├─ TensorRT FP16 engine
    │      └─ NMS post-processing (torchvision.ops.batched_nms)
    │
    ├─→ Kafka (topic: detections)   ← JSON events
    │
    ├─→ TimescaleDB (time-series metrics per camera)
    │
    └─→ Alert Service (thresholds → PagerDuty / Slack)

Step 4 — Scale Decisions

DecisionChoiceWhy
Frame queueKafkaBack-pressure, replay, fan-out
Batching strategyDynamic (max 32, max 5ms)Balance latency vs throughput
GPU schedulingNVIDIA TritonBuilt-in dynamic batching
Model formatTensorRT FP163-5× faster than PyTorch
Camera shardingcamera_id % n_partitionsEven load distribution

Step 5 — Monitor

  • GPU utilization (target > 80%)
  • Queue lag (Kafka consumer lag < 1000 messages)
  • p99 inference latency
  • mAP drift (weekly evaluation against labeled validation set)
  • False positive rate per camera (per-site calibration needed)

Walkthrough 2: Face Recognition System

Prompt: Design a face recognition system for a 10,000-employee company.

Step 1 — Clarify

  • Use case: door access control (security) + attendance tracking
  • Latency: < 500ms for live access decisions
  • Scale: 10K employees, ~50 doors, ~1000 face lookups/minute peak
  • Accuracy: FAR (False Accept Rate) < 0.01%, FRR (False Reject Rate) < 1%

Step 2 — Estimates

  • MTCNN face detection: ~20ms/frame
  • ArcFace embedding: ~10ms/frame (ResNet-50 backbone)
  • FAISS flat search over 10K faces: ~1ms
  • Total: ~31ms → well within 500ms budget
  • Storage: 10K employees × 1 embedding × 512 floats × 4 bytes = 20 MB (trivial)

Step 3 — Architecture

Camera Frame
    │
    ▼
Face Detection (MTCNN)         ← detect + align face to 112×112
    │
    ▼
Quality Filter                  ← reject blurry, occluded, non-frontal
    │ (Laplacian variance > 100, face area > 5% of frame)
    ▼
ArcFace Embedding               ← 512-dim L2-normalized vector
    │
    ▼
FAISS IndexFlatIP               ← cosine similarity search
    │ similarity threshold: 0.65
    ├── Match found → employee_id → access_log → grant/deny
    └── No match → flag for security review

Step 4 — Database & Enrollment

# Enrollment: add new employee
embedding = arcface_model(align_face(img))  # (512,) normalized
faiss_index.add(embedding.reshape(1, -1))
employee_db[faiss_index.ntotal - 1] = employee_id

# For production: use IndexIVFFlat for > 1M faces
# nlist = 100 centroids, nprobe = 10 → 10x speedup vs flat
index = faiss.IndexIVFFlat(quantizer, 512, 100, faiss.METRIC_INNER_PRODUCT)
index.train(all_embeddings)

Step 5 — Anti-Spoofing

  • Liveness detection: check for eye blink, head movement, or use IR depth camera
  • Presentation attack: binary classifier on face texture (MobileNetV2, trained on fake face datasets)
  • Audit log: store encrypted embedding + timestamp for compliance

Walkthrough 3: Video Content Moderation

Prompt: Design a system that moderates 1M videos/day for inappropriate content.

Step 1 — Clarify

  • Latency: async (within 5 minutes of upload is fine)
  • Scale: 1M videos/day = ~12 videos/second average, 100 peak
  • Content: violence, adult content, hate symbols
  • SLA: < 0.1% harmful content reaches users

Step 2 — Pipeline Design

Video Upload (S3)
    │
    ▼
Frame Sampling Service          ← 1 fps for most videos
    │ skip identical frames (perceptual hash)          
    ▼
Multi-Label Classifier          ← EfficientNet-B4 → 5 categories
    │ batch=64, A100, ~500 fps
    ▼
Risk Scorer                     ← max(category_scores) × duration_weight
    │
    ├── score < 0.3 → Auto-Approve
    ├── score 0.3-0.7 → Human Review Queue (Mechanical Turk)
    └── score > 0.7 → Auto-Reject + notify uploader

Step 3 — Frame Sampling Strategy

def sample_frames(video_path, fps=1.0, max_frames=300):
    """
    1 FPS + dedup = covers 99% of content with minimal compute.
    """
    cap = cv2.VideoCapture(video_path)
    video_fps = cap.get(cv2.CAP_PROP_FPS)
    interval = max(1, int(video_fps / fps))
    
    frames, prev_hash = [], None
    i = 0
    while cap.isOpened():
        ret, frame = cap.read()
        if not ret: break
        if i % interval == 0:
            h = dhash(cv2.resize(frame, (8, 8)))
            if prev_hash is None or bin(h ^ prev_hash).count('1') > 5:
                frames.append(frame)
                prev_hash = h
        i += 1
    return frames[:max_frames]

Step 4 — Human Review Optimization

  • Prioritize: sort review queue by (risk_score × video_length)
  • Context: show reviewer 3 highest-risk frames + metadata
  • Feedback loop: reviewer decisions → retrain classifier weekly
  • Active learning: add uncertain predictions (0.4-0.6 score) to training set

Walkthrough 4: Autonomous Vehicle Perception Pipeline

Prompt: Design the perception stack for a Level 2 ADAS system.

Step 1 — Requirements

  • Sensors: 8 cameras (surround), 1 LiDAR, 4 radar
  • Latency: < 50ms end-to-end (33ms = 30 Hz)
  • Safety: must detect pedestrians at 50m with > 99.9% recall
  • Compute: embedded (NVIDIA Orin, 254 TOPS)

Step 2 — Architecture

Cameras (8×) → ISP → JPEG decode → GPU memory
LiDAR        → Point cloud → voxelization
Radar        → CFAR detection → velocity clusters

                    │
                    ▼
        ┌─── BEV Feature Extractor ───┐
        │  Camera: LSS / BEVFusion    │
        │  LiDAR:  PointPillars       │
        └─────────────────────────────┘
                    │ Bird's Eye View (BEV) feature map
                    ▼
        ┌─── 3D Object Detection ─────┐   CenterPoint / DETR3D
        ├─── Lane Detection ──────────┤   BezierLaneNet
        └─── Occupancy Prediction ────┘   Tesla Occupancy Networks
                    │
                    ▼
              Sensor Fusion           ← Kalman Filter per track
                    │
                    ▼
          HD Map + Ego Pose           ← RT-SLAM / GPS+IMU
                    │
                    ▼
          Planning Interface          ← object list + velocity vectors

Step 3 — Latency Budget (50ms)

StageBudget
Sensor capture + DMA5ms
Preprocessing (debayer, resize)3ms
BEV feature extraction18ms
Detection heads8ms
Sensor fusion + tracking5ms
Output marshaling1ms
Total40ms (10ms margin)

Step 4 — Safety Considerations

  • Redundancy: radar provides independent velocity estimates
  • OOD detection: uncertainty heads on detection model; trigger conservative behavior
  • Temporal consistency: detections must be tracked ≥ 3 frames before acting on them
  • Simulation testing: 1 billion virtual miles before road testing

Walkthrough 5: Medical Image Diagnosis System

Prompt: Design an AI system to assist radiologists reading chest X-rays.

Step 1 — Clarify

  • Task: multi-label classification (14 pathologies) + localization
  • Scale: 10K X-rays/day, 200 hospitals
  • Latency: < 5 seconds (radiologist sees AI result before reading)
  • Regulatory: FDA 510(k) clearance needed → explainability required

Step 2 — Architecture

DICOM Upload (hospital PACS)
    │
    ▼
DICOM Parser + Normalization     ← pydicom, window-level normalization
    │
    ▼
Quality Filter                   ← check for rotation, artifacts, exposure
    │
    ▼
DenseNet-121 (CheXNet-style)     ← pretrained on CheXpert/NIH-14
    │
    ├── 14 pathology scores       ← sigmoid output, 0-1 confidence
    │
    └── GradCAM heatmaps          ← highlight regions driving prediction
                    │
                    ▼
          Radiologist Dashboard   ← highlight boxes + confidence scores
                    │
                    ▼
          Human Decision          ← radiologist confirms/overrides
                    │
                    ▼
          Feedback Loop           ← confirmed cases → re-training dataset

Step 3 — MLflow Experiment Tracking

import mlflow

with mlflow.start_run(run_name="densenet121_chexpert_v3"):
    mlflow.log_params({"model": "DenseNet121", "pretrain": "CheXpert", "epochs": 50})
    for epoch in range(epochs):
        metrics = evaluate(model, val_loader)
        mlflow.log_metrics(metrics, step=epoch)
    mlflow.pytorch.log_model(model, "model",
        registered_model_name="chest_xray_classifier")

Step 4 — Regulatory Compliance

  • Explainability: GradCAM mandatory for FDA submission
  • Bias auditing: validate AUC separately for age/gender/race subgroups
  • Model versioning: every deployed model version tracked in registry
  • Shadow deployment: new model runs in parallel with existing for 30 days before replacement
  • Uncertainty quantification: MC Dropout → flag low-confidence cases for mandatory human review