Real-Time Video Analytics System Design
Interview Question: "Design a system that processes 1,000 live camera streams to detect safety violations in a factory, with results displayed on a dashboard within 3 seconds."
Step 1: Clarify Requirements
Functional:
- Ingest 1,000 RTSP camera streams at 30 FPS
- Run object detection + classification per frame
- Alert on violations within 3 seconds of occurrence
- Store events with video clips for review
- Dashboard showing live status per camera
Non-functional:
- Latency: < 3 second end-to-end (ingestion → alert)
- Availability: 99.9% (factory safety system)
- Throughput: 1,000 streams × 30 FPS = 30,000 frames/second
- Scale: eventually 10,000 cameras
Step 2: Back-of-Envelope Math
Streams: 1,000 cameras × 30 FPS = 30,000 frames/sec
Frame size: 1920×1080 × 3 bytes (BGR) = 6.2 MB raw
After H.264 decode: ~0.2 MB per frame
Total ingestion bandwidth: 30,000 × 0.2 MB = 6 GB/s raw data
YOLOv8m inference:
- GPU: ~8ms per frame at batch_size=1
- At batch_size=32: ~15ms → ~2,100 frames/sec per A100
- Needed: 30,000 / 2,100 ≈ 15 A100 GPUs for real-time
Storage:
- Events only (1% of frames): 300 frames/sec × 0.2 MB = 60 MB/s = 5 TB/day
- Retain 30 days: 150 TB → Object storage (S3/GCS)
High-Level Architecture
┌─────────────────────────────────────────────┐
│ Camera Network (RTSP) │
│ Cam 1 ... Cam 1000 (H.264/H.265 streams) │
└──────────────────┬──────────────────────────┘
│ RTSP pull
┌──────────────────▼──────────────────────────┐
│ Ingest Layer │
│ ┌───────────┐ ┌───────────┐ ┌─────────┐ │
│ │ Ingest-01 │ │ Ingest-02 │ │ ... │ │
│ │ (FFmpeg + │ │ (FFmpeg) │ │ │ │
│ │ 50 cams) │ │ 50 cams │ │ 20 pods│ │
│ └─────┬─────┘ └─────┬─────┘ └────┬────┘ │
└────────┼──────────────┼──────────────┼──────┘
│ │ │
┌────────▼──────────────▼──────────────▼──────┐
│ Apache Kafka │
│ Topic: frames (1000 partitions) │
│ Partition key: camera_id │
│ Retention: 2 hours │
└─────────────────────┬────────────────────────┘
│
┌─────────────────────▼────────────────────────┐
│ Inference Workers │
│ ┌─────────────┐ ┌─────────────┐ │
│ │ GPU Worker │ │ GPU Worker │ × 15 pods │
│ │ A100 80GB │ │ A100 80GB │ │
│ │ batch=32 │ │ batch=32 │ │
│ │ TensorRT │ │ TensorRT │ │
│ └──────┬──────┘ └──────┬──────┘ │
└─────────┼────────────────┼───────────────────┘
│ │
┌────────────▼────────────────▼────────────────┐
│ Results Kafka Topic │
└──────────────────────┬─────────────────────┘
┌─────────────────┼─────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌────────────┐ ┌──────────────────┐
│ Alert Service │ │ Event DB │ │ Dashboard Service │
│ (PagerDuty/ │ │(PostgreSQL)│ │(WebSocket → UI) │
│ SMS/Email) │ │+ S3 clips │ │ │
└──────────────┘ └────────────┘ └──────────────────┘
Component Deep Dives
Ingest Layer
Each ingest pod handles 50 RTSP streams using FFmpeg. Key design decisions:
# ingest_worker.py — one process per camera
import av # PyAV — Python bindings for FFmpeg
def ingest_camera(camera_url: str, camera_id: str, kafka_producer):
container = av.open(camera_url)
stream = container.streams.video[0]
stream.codec_context.skip_frame = 'NONREF' # Only keyframes + P-frames
frame_count = 0
for packet in container.demux(stream):
for frame in packet.decode():
frame_count += 1
# Process only every 3rd frame (10 FPS effective) to save GPU compute
if frame_count % 3 != 0:
continue
# Convert to numpy and encode as JPEG (10-30× smaller than raw)
img = frame.to_ndarray(format='bgr24')
_, encoded = cv2.imencode('.jpg', img, [cv2.IMWRITE_JPEG_QUALITY, 85])
kafka_producer.send('frames', key=camera_id.encode(), value={
'camera_id': camera_id,
'timestamp': frame.pts * stream.time_base,
'frame': encoded.tobytes()
})
Backpressure: Kafka consumer groups allow inference workers to pull at their own rate. If GPU workers are slow, frames back up in Kafka (retained for 2 hours). The ingest layer never blocks.
GPU Inference Worker
# inference_worker.py
from kafka import KafkaConsumer
import torch
import tensorrt as trt
from collections import defaultdict
class BatchInferenceWorker:
def __init__(self, model_path: str, batch_size: int = 32):
self.engine = load_trt_engine(model_path)
self.batch_size = batch_size
self.pending = []
def run(self, consumer: KafkaConsumer):
for message in consumer:
self.pending.append(message)
# Batch up frames from multiple cameras for GPU efficiency
if len(self.pending) >= self.batch_size:
self._process_batch()
def _process_batch(self):
frames = [decode_jpeg(m.value['frame']) for m in self.pending]
# Preprocess: resize, normalize
batch = preprocess_batch(frames) # (B, 3, 640, 640) on GPU
detections = self.engine.infer(batch) # TensorRT inference
# Post-process: NMS per image
results = [apply_nms(det, conf_thresh=0.5, iou_thresh=0.45)
for det in detections]
# Publish results
for msg, result in zip(self.pending, results):
publish_result(msg, result)
self.pending.clear()
Dynamic batching: Don't wait for a full batch — set a max_wait_ms=20. If 32 frames arrive within 20ms, great. If only 10 arrive, process them. This bounds added latency.
Frame Sampling Strategy
Full 30 FPS is usually wasteful. Use adaptive sampling:
- Static cameras (factory floor): 5-10 FPS is sufficient for violation detection
- Moving cameras (pan-tilt-zoom): Use motion detection to trigger higher FPS
- Event-based: Run background subtraction cheaply (CPU), only send GPU frames with detected motion
This can reduce GPU load by 5-10×.
Scalability
Horizontal Scaling
All components are stateless (Kafka decouples producers from consumers):
10 cameras → 1 ingest pod, 1 GPU worker
1,000 cameras → 20 ingest pods, 15 GPU workers
10,000 cameras → 200 ingest pods, 150 GPU workers (linear!)
Auto-scaling
# Kubernetes HPA for inference workers
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
metrics:
- type: External
external:
metric:
name: kafka_consumer_group_lag
target:
type: AverageValue
averageValue: "1000" # Scale up if >1000 frames lag per worker
Failure Handling
| Failure | Recovery |
|---|---|
| Ingest pod crashes | Kubernetes restarts in <30s; camera reconnects automatically |
| GPU worker crashes | Kafka offset not committed → messages re-delivered to other workers |
| Kafka broker fails | Replication factor=3: other brokers have the data |
| Model returns wrong results | Rollback via model versioning (MLflow); shadow mode deployment |
Latency Budget
Camera capture: 0 ms (starting point)
H.264 encode at camera: 33 ms (1 frame at 30fps)
Network transmission: 10 ms (LAN)
Kafka ingest: 5 ms
Queue wait (max): 50 ms (at peak load)
GPU decode + preprocess: 5 ms
TensorRT inference: 8 ms
Post-process (NMS): 2 ms
Kafka result publish: 3 ms
Alert service: 5 ms
─────────────────────────────
Total: ~121 ms ← Well within 3 second SLA
The 3-second SLA is easy to meet. The real engineering challenge is maintaining <200ms end-to-end at P99 under load spikes.
Monitoring & Observability
# Key metrics to track
METRICS = {
'frame_processing_latency_p99': 'SLA alert if > 500ms',
'kafka_consumer_lag': 'Indicates worker capacity',
'gpu_utilization': 'Should be 70-90% at steady state',
'gpu_memory_used': 'Alert if > 90% (OOM risk)',
'inference_accuracy': 'Shadow-test with human labels',
'false_positive_rate': 'Alert fatigue metric',
'frames_per_second_processed': 'Throughput tracking',
}
Model drift detection: Deploy a periodic job that takes a stratified sample of processed frames, runs human review on 0.1%, and computes accuracy drift. Alert if accuracy drops >3% from baseline.