Real-Time Video Analytics System Design

Interview Question: "Design a system that processes 1,000 live camera streams to detect safety violations in a factory, with results displayed on a dashboard within 3 seconds."

Step 1: Clarify Requirements

Functional:

Ingest 1,000 RTSP camera streams at 30 FPS
Run object detection + classification per frame
Alert on violations within 3 seconds of occurrence
Store events with video clips for review
Dashboard showing live status per camera

Non-functional:

Latency: < 3 second end-to-end (ingestion → alert)
Availability: 99.9% (factory safety system)
Throughput: 1,000 streams × 30 FPS = 30,000 frames/second
Scale: eventually 10,000 cameras

Step 2: Back-of-Envelope Math

Streams: 1,000 cameras × 30 FPS = 30,000 frames/sec
Frame size: 1920×1080 × 3 bytes (BGR) = 6.2 MB raw
After H.264 decode: ~0.2 MB per frame
Total ingestion bandwidth: 30,000 × 0.2 MB = 6 GB/s raw data

YOLOv8m inference:
  - GPU: ~8ms per frame at batch_size=1
  - At batch_size=32: ~15ms → ~2,100 frames/sec per A100
  - Needed: 30,000 / 2,100 ≈ 15 A100 GPUs for real-time

Storage:
  - Events only (1% of frames): 300 frames/sec × 0.2 MB = 60 MB/s = 5 TB/day
  - Retain 30 days: 150 TB → Object storage (S3/GCS)

High-Level Architecture

                  ┌─────────────────────────────────────────────┐
                  │           Camera Network (RTSP)              │
                  │   Cam 1 ... Cam 1000 (H.264/H.265 streams)  │
                  └──────────────────┬──────────────────────────┘
                                     │ RTSP pull
                  ┌──────────────────▼──────────────────────────┐
                  │           Ingest Layer                        │
                  │  ┌───────────┐  ┌───────────┐  ┌─────────┐ │
                  │  │ Ingest-01 │  │ Ingest-02 │  │   ...   │ │
                  │  │ (FFmpeg + │  │  (FFmpeg) │  │         │ │
                  │  │ 50 cams)  │  │  50 cams  │  │  20 pods│ │
                  │  └─────┬─────┘  └─────┬─────┘  └────┬────┘ │
                  └────────┼──────────────┼──────────────┼──────┘
                           │              │              │
                  ┌────────▼──────────────▼──────────────▼──────┐
                  │           Apache Kafka                        │
                  │  Topic: frames  (1000 partitions)            │
                  │  Partition key: camera_id                    │
                  │  Retention: 2 hours                         │
                  └─────────────────────┬────────────────────────┘
                                        │
                  ┌─────────────────────▼────────────────────────┐
                  │         Inference Workers                      │
                  │  ┌─────────────┐  ┌─────────────┐           │
                  │  │ GPU Worker  │  │ GPU Worker  │  × 15 pods │
                  │  │ A100 80GB   │  │ A100 80GB   │           │
                  │  │ batch=32    │  │ batch=32    │           │
                  │  │ TensorRT    │  │ TensorRT    │           │
                  │  └──────┬──────┘  └──────┬──────┘           │
                  └─────────┼────────────────┼───────────────────┘
                            │                │
               ┌────────────▼────────────────▼────────────────┐
               │              Results Kafka Topic              │
               └──────────────────────┬─────────────────────┘
                    ┌─────────────────┼─────────────────────┐
                    ▼                 ▼                      ▼
            ┌──────────────┐  ┌────────────┐  ┌──────────────────┐
            │ Alert Service │  │  Event DB  │  │ Dashboard Service │
            │ (PagerDuty/  │  │(PostgreSQL)│  │(WebSocket → UI)  │
            │  SMS/Email)  │  │+ S3 clips  │  │                  │
            └──────────────┘  └────────────┘  └──────────────────┘

Component Deep Dives

Ingest Layer

Each ingest pod handles 50 RTSP streams using FFmpeg. Key design decisions:

# ingest_worker.py — one process per camera
import av  # PyAV — Python bindings for FFmpeg

def ingest_camera(camera_url: str, camera_id: str, kafka_producer):
    container = av.open(camera_url)
    stream = container.streams.video[0]
    stream.codec_context.skip_frame = 'NONREF'  # Only keyframes + P-frames

    frame_count = 0
    for packet in container.demux(stream):
        for frame in packet.decode():
            frame_count += 1
            # Process only every 3rd frame (10 FPS effective) to save GPU compute
            if frame_count % 3 != 0:
                continue

            # Convert to numpy and encode as JPEG (10-30× smaller than raw)
            img = frame.to_ndarray(format='bgr24')
            _, encoded = cv2.imencode('.jpg', img, [cv2.IMWRITE_JPEG_QUALITY, 85])

            kafka_producer.send('frames', key=camera_id.encode(), value={
                'camera_id': camera_id,
                'timestamp': frame.pts * stream.time_base,
                'frame': encoded.tobytes()
            })

Backpressure: Kafka consumer groups allow inference workers to pull at their own rate. If GPU workers are slow, frames back up in Kafka (retained for 2 hours). The ingest layer never blocks.

GPU Inference Worker

# inference_worker.py
from kafka import KafkaConsumer
import torch
import tensorrt as trt
from collections import defaultdict

class BatchInferenceWorker:
    def __init__(self, model_path: str, batch_size: int = 32):
        self.engine = load_trt_engine(model_path)
        self.batch_size = batch_size
        self.pending = []

    def run(self, consumer: KafkaConsumer):
        for message in consumer:
            self.pending.append(message)

            # Batch up frames from multiple cameras for GPU efficiency
            if len(self.pending) >= self.batch_size:
                self._process_batch()

    def _process_batch(self):
        frames = [decode_jpeg(m.value['frame']) for m in self.pending]
        # Preprocess: resize, normalize
        batch = preprocess_batch(frames)  # (B, 3, 640, 640) on GPU
        detections = self.engine.infer(batch)  # TensorRT inference
        # Post-process: NMS per image
        results = [apply_nms(det, conf_thresh=0.5, iou_thresh=0.45)
                   for det in detections]
        # Publish results
        for msg, result in zip(self.pending, results):
            publish_result(msg, result)
        self.pending.clear()

Dynamic batching: Don't wait for a full batch — set a max_wait_ms=20. If 32 frames arrive within 20ms, great. If only 10 arrive, process them. This bounds added latency.

Frame Sampling Strategy

Full 30 FPS is usually wasteful. Use adaptive sampling:

Static cameras (factory floor): 5-10 FPS is sufficient for violation detection
Moving cameras (pan-tilt-zoom): Use motion detection to trigger higher FPS
Event-based: Run background subtraction cheaply (CPU), only send GPU frames with detected motion

This can reduce GPU load by 5-10×.

Scalability

Horizontal Scaling

All components are stateless (Kafka decouples producers from consumers):

10 cameras → 1 ingest pod, 1 GPU worker
1,000 cameras → 20 ingest pods, 15 GPU workers  
10,000 cameras → 200 ingest pods, 150 GPU workers (linear!)

Auto-scaling

# Kubernetes HPA for inference workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  metrics:
  - type: External
    external:
      metric:
        name: kafka_consumer_group_lag
      target:
        type: AverageValue
        averageValue: "1000"  # Scale up if >1000 frames lag per worker

Failure Handling

Failure	Recovery
Ingest pod crashes	Kubernetes restarts in <30s; camera reconnects automatically
GPU worker crashes	Kafka offset not committed → messages re-delivered to other workers
Kafka broker fails	Replication factor=3: other brokers have the data
Model returns wrong results	Rollback via model versioning (MLflow); shadow mode deployment

Latency Budget

Camera capture:              0 ms (starting point)
H.264 encode at camera:     33 ms (1 frame at 30fps)
Network transmission:       10 ms (LAN)
Kafka ingest:               5 ms
Queue wait (max):           50 ms (at peak load)
GPU decode + preprocess:    5 ms
TensorRT inference:         8 ms
Post-process (NMS):         2 ms
Kafka result publish:       3 ms
Alert service:              5 ms
─────────────────────────────
Total:                     ~121 ms  ← Well within 3 second SLA

The 3-second SLA is easy to meet. The real engineering challenge is maintaining <200ms end-to-end at P99 under load spikes.

Monitoring & Observability

# Key metrics to track
METRICS = {
    'frame_processing_latency_p99': 'SLA alert if > 500ms',
    'kafka_consumer_lag':           'Indicates worker capacity',
    'gpu_utilization':              'Should be 70-90% at steady state',
    'gpu_memory_used':              'Alert if > 90% (OOM risk)',
    'inference_accuracy':           'Shadow-test with human labels',
    'false_positive_rate':          'Alert fatigue metric',
    'frames_per_second_processed':  'Throughput tracking',
}

Model drift detection: Deploy a periodic job that takes a stratified sample of processed frames, runs human review on 0.1%, and computes accuracy drift. Alert if accuracy drops >3% from baseline.

AI Engineer — Role-Based Learning Hub