Model Serving at Scale
"It works in training" is not enough. This doc covers everything needed to serve CV models reliably at production load.
Latency vs Throughput Tradeoff
Latency: time for a single request (p50/p95/p99)
Throughput: requests processed per second (RPS)
They're fundamentally in tension:
- Batching increases throughput (GPU utilization goes up) but adds latency (waiting to fill the batch)
- Single-request serving minimizes latency but wastes GPU (utilization may be 5%)
Throughput vs Latency for YOLOv8m on A100:
batch_size=1: 8ms latency, 125 RPS, GPU util=12%
batch_size=8: 12ms latency, 667 RPS, GPU util=45%
batch_size=32: 22ms latency, 1455 RPS, GPU util=78%
batch_size=64: 35ms latency, 1828 RPS, GPU util=90%
batch_size=128: 60ms latency, 2133 RPS, GPU util=95%
Design rule: Choose the largest batch size where latency stays within SLA.
NVIDIA Triton Inference Server
Triton is the production standard for serving CV/ML models at scale.
Why Triton?
- Multi-framework: PyTorch (TorchScript), ONNX, TensorRT, TensorFlow, Python backends
- Dynamic batching: collects requests arriving within a configurable window and batches them automatically — no client-side batching needed
- Concurrent model instances: run N copies of the model simultaneously on one GPU
- Model pipelines: chain models (preprocessor → detector → classifier) in a single server
- gRPC + HTTP: standardized API with metrics, health checks
Configuration
model_repository/
└── yolov8_detector/
├── config.pbtxt
└── 1/
└── model.plan (TensorRT engine)
# config.pbtxt
name: "yolov8_detector"
backend: "tensorrt"
max_batch_size: 64
input [{ name: "images" data_type: TYPE_FP32 dims: [3, 640, 640] }]
output [{ name: "output0" data_type: TYPE_FP32 dims: [-1, 8400] }]
dynamic_batching {
preferred_batch_size: [8, 16, 32, 64]
max_queue_delay_microseconds: 5000 # wait up to 5ms to fill batch
}
instance_group [{ kind: KIND_GPU count: 2 }] # 2 model instances per GPU
Python Client
import tritonclient.grpc as grpcclient
import numpy as np
client = grpcclient.InferenceServerClient("triton-server:8001")
# Async client for maximum throughput
async def infer(image_batch: np.ndarray):
inputs = [grpcclient.InferInput("images", image_batch.shape, "FP32")]
inputs[0].set_data_from_numpy(image_batch)
outputs = [grpcclient.InferRequestedOutput("output0")]
result = await client.async_infer("yolov8_detector", inputs, outputs=outputs)
return result.as_numpy("output0")
FastAPI Inference Microservice
For teams not using Triton — a clean FastAPI pattern:
# app.py
from fastapi import FastAPI, File, UploadFile
from contextlib import asynccontextmanager
import torch
import asyncio
from typing import AsyncIterator
import numpy as np
import cv2
# ── Model Loading ──────────────────────────────────────────────────
model = None
@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
global model
model = load_model("yolov8m.pt") # loaded once at startup
model.eval()
yield
# Cleanup on shutdown
app = FastAPI(lifespan=lifespan)
# ── Request Batching Queue ─────────────────────────────────────────
class BatchProcessor:
def __init__(self, max_batch: int = 32, max_wait_ms: float = 10):
self.queue: asyncio.Queue = asyncio.Queue()
self.max_batch = max_batch
self.max_wait_ms = max_wait_ms
async def add_request(self, img: np.ndarray) -> list:
future = asyncio.get_event_loop().create_future()
await self.queue.put((img, future))
return await future # blocks until result ready
async def worker(self):
"""Background task collecting and batching requests."""
while True:
batch_imgs, batch_futures = [], []
deadline = asyncio.get_event_loop().time() + self.max_wait_ms / 1000
# Collect up to max_batch requests
while len(batch_imgs) < self.max_batch:
try:
timeout = deadline - asyncio.get_event_loop().time()
if timeout <= 0: break
img, fut = await asyncio.wait_for(self.queue.get(), timeout)
batch_imgs.append(img)
batch_futures.append(fut)
except asyncio.TimeoutError:
break
if not batch_imgs:
await asyncio.sleep(0.001)
continue
# Run batch inference
results = run_inference_batch(batch_imgs)
for fut, res in zip(batch_futures, results):
fut.set_result(res)
batcher = BatchProcessor()
@app.post("/detect")
async def detect(file: UploadFile = File(...)):
contents = await file.read()
img = cv2.imdecode(np.frombuffer(contents, np.uint8), cv2.IMREAD_COLOR)
detections = await batcher.add_request(img)
return {"detections": detections}
@app.get("/health")
async def health():
return {"status": "ok", "gpu": torch.cuda.is_available()}
Horizontal Scaling with Kubernetes
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: cv-inference
spec:
replicas: 4
selector:
matchLabels: { app: cv-inference }
template:
spec:
containers:
- name: cv-inference
image: myregistry/cv-inference:v2.1
resources:
limits:
nvidia.com/gpu: "1"
memory: "16Gi"
requests:
nvidia.com/gpu: "1"
memory: "8Gi"
env:
- name: MODEL_PATH
value: "s3://models/yolov8m-trt-fp16.engine"
readinessProbe:
httpGet: { path: /health, port: 8080 }
initialDelaySeconds: 30 # Model load time
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
minReplicas: 2
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
- type: Pods
pods:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: "100"
Caching & Pre-computation
Not all frames need real-time inference. Strategic caching:
# Frame deduplication using perceptual hash
import imagehash
from PIL import Image
def phash_key(frame: np.ndarray, threshold: int = 10) -> str:
pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
return str(imagehash.phash(pil))
class InferenceCache:
def __init__(self, redis_client, ttl_sec: int = 2):
self.redis = redis_client
self.ttl = ttl_sec
def get(self, frame: np.ndarray):
key = phash_key(frame)
cached = self.redis.get(key)
if cached:
return deserialize(cached), True # cache hit
return None, False
def set(self, frame: np.ndarray, result: dict):
key = phash_key(frame)
self.redis.setex(key, self.ttl, serialize(result))
For static cameras: consecutive frames of an empty scene are perceptually identical. Cache hit rate can be 80-90%, reducing GPU load dramatically.
Model Versioning & Blue-Green Deployment
# Model registry pattern (MLflow-compatible)
class ModelRegistry:
def __init__(self):
self.versions = {
"production": load_model("v2.1"),
"staging": load_model("v2.2"), # new version being validated
}
self.shadow_fraction = 0.05 # 5% of traffic to shadow
def infer(self, x, *, shadow: bool = False):
prod_result = self.versions["production"](x)
if shadow and random.random() < self.shadow_fraction:
# Run staging model on same input, log comparison
staging_result = self.versions["staging"](x)
log_comparison(prod_result, staging_result) # async
return prod_result # always return production result to user
Interview Questions
Q: Design the batching strategy for a real-time video analytics API with a 100ms SLA.
A: First, I'd establish the math: at 100ms SLA, we can afford at most 80ms queuing + processing time (leaving buffer). With TensorRT YOLOv8 at ~8ms inference time, I'd set max_queue_delay to 50ms with preferred batch sizes of 8, 16, 32. Clients hitting us from 100 cameras at 10 FPS each = 1,000 RPS. At batch_size=32 taking ~22ms, we need 1000/1455 ≈ 1 A100 for steady-state. I'd scale to 2 for redundancy. For burst capacity, set HPA to trigger at 70% queue saturation.
Q: How would you do A/B testing for a new model version in production?
A: I'd use a shadow deployment first: route 5% of production traffic to the new model, compare outputs with production model, but always return production results to users. Monitor: mAP on a labeled evaluation set, latency p99, GPU memory, false positive rate. After 24 hours if metrics look good, promote to a canary (10% of traffic actually served from new model), monitor for user-visible metrics (false alert rate). If metrics hold for 48 hours, full rollout via blue-green: spin up new deployment, switch load balancer, keep old deployment on standby for 24 hours in case of rollback.
Q: How do you handle model warm-up in production?
A: A freshly loaded model has cold CUDA caches — the first few inferences are 2-5× slower than steady-state (GPU kernel compilation happens on first run). Best practices: (1) Kubernetes readiness probe set to 30s+ delay after container start, (2) Run N warmup inferences during lifespan startup before marking service as ready, (3) Use torch.compile or TensorRT which pre-compiles kernels at build time, eliminating first-inference lag entirely.