Model Serving at Scale

"It works in training" is not enough. This doc covers everything needed to serve CV models reliably at production load.


Latency vs Throughput Tradeoff

Latency: time for a single request (p50/p95/p99)
Throughput: requests processed per second (RPS)

They're fundamentally in tension:

  • Batching increases throughput (GPU utilization goes up) but adds latency (waiting to fill the batch)
  • Single-request serving minimizes latency but wastes GPU (utilization may be 5%)
Throughput vs Latency for YOLOv8m on A100:

batch_size=1:   8ms latency,  125 RPS,  GPU util=12%
batch_size=8:   12ms latency, 667 RPS,  GPU util=45%
batch_size=32:  22ms latency, 1455 RPS, GPU util=78%
batch_size=64:  35ms latency, 1828 RPS, GPU util=90%
batch_size=128: 60ms latency, 2133 RPS, GPU util=95%

Design rule: Choose the largest batch size where latency stays within SLA.


NVIDIA Triton Inference Server

Triton is the production standard for serving CV/ML models at scale.

Why Triton?

  • Multi-framework: PyTorch (TorchScript), ONNX, TensorRT, TensorFlow, Python backends
  • Dynamic batching: collects requests arriving within a configurable window and batches them automatically — no client-side batching needed
  • Concurrent model instances: run N copies of the model simultaneously on one GPU
  • Model pipelines: chain models (preprocessor → detector → classifier) in a single server
  • gRPC + HTTP: standardized API with metrics, health checks

Configuration

model_repository/
└── yolov8_detector/
    ├── config.pbtxt
    └── 1/
        └── model.plan  (TensorRT engine)
# config.pbtxt
name: "yolov8_detector"
backend: "tensorrt"
max_batch_size: 64

input [{ name: "images" data_type: TYPE_FP32 dims: [3, 640, 640] }]
output [{ name: "output0" data_type: TYPE_FP32 dims: [-1, 8400] }]

dynamic_batching {
  preferred_batch_size: [8, 16, 32, 64]
  max_queue_delay_microseconds: 5000  # wait up to 5ms to fill batch
}

instance_group [{ kind: KIND_GPU count: 2 }]  # 2 model instances per GPU

Python Client

import tritonclient.grpc as grpcclient
import numpy as np

client = grpcclient.InferenceServerClient("triton-server:8001")

# Async client for maximum throughput
async def infer(image_batch: np.ndarray):
    inputs = [grpcclient.InferInput("images", image_batch.shape, "FP32")]
    inputs[0].set_data_from_numpy(image_batch)
    outputs = [grpcclient.InferRequestedOutput("output0")]
    result = await client.async_infer("yolov8_detector", inputs, outputs=outputs)
    return result.as_numpy("output0")

FastAPI Inference Microservice

For teams not using Triton — a clean FastAPI pattern:

# app.py
from fastapi import FastAPI, File, UploadFile
from contextlib import asynccontextmanager
import torch
import asyncio
from typing import AsyncIterator
import numpy as np
import cv2

# ── Model Loading ──────────────────────────────────────────────────
model = None

@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncIterator[None]:
    global model
    model = load_model("yolov8m.pt")  # loaded once at startup
    model.eval()
    yield
    # Cleanup on shutdown

app = FastAPI(lifespan=lifespan)

# ── Request Batching Queue ─────────────────────────────────────────
class BatchProcessor:
    def __init__(self, max_batch: int = 32, max_wait_ms: float = 10):
        self.queue: asyncio.Queue = asyncio.Queue()
        self.max_batch = max_batch
        self.max_wait_ms = max_wait_ms

    async def add_request(self, img: np.ndarray) -> list:
        future = asyncio.get_event_loop().create_future()
        await self.queue.put((img, future))
        return await future  # blocks until result ready

    async def worker(self):
        """Background task collecting and batching requests."""
        while True:
            batch_imgs, batch_futures = [], []
            deadline = asyncio.get_event_loop().time() + self.max_wait_ms / 1000

            # Collect up to max_batch requests
            while len(batch_imgs) < self.max_batch:
                try:
                    timeout = deadline - asyncio.get_event_loop().time()
                    if timeout <= 0: break
                    img, fut = await asyncio.wait_for(self.queue.get(), timeout)
                    batch_imgs.append(img)
                    batch_futures.append(fut)
                except asyncio.TimeoutError:
                    break

            if not batch_imgs:
                await asyncio.sleep(0.001)
                continue

            # Run batch inference
            results = run_inference_batch(batch_imgs)
            for fut, res in zip(batch_futures, results):
                fut.set_result(res)

batcher = BatchProcessor()

@app.post("/detect")
async def detect(file: UploadFile = File(...)):
    contents = await file.read()
    img = cv2.imdecode(np.frombuffer(contents, np.uint8), cv2.IMREAD_COLOR)
    detections = await batcher.add_request(img)
    return {"detections": detections}

@app.get("/health")
async def health():
    return {"status": "ok", "gpu": torch.cuda.is_available()}

Horizontal Scaling with Kubernetes

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cv-inference
spec:
  replicas: 4
  selector:
    matchLabels: { app: cv-inference }
  template:
    spec:
      containers:
      - name: cv-inference
        image: myregistry/cv-inference:v2.1
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "16Gi"
          requests:
            nvidia.com/gpu: "1"
            memory: "8Gi"
        env:
        - name: MODEL_PATH
          value: "s3://models/yolov8m-trt-fp16.engine"
        readinessProbe:
          httpGet: { path: /health, port: 8080 }
          initialDelaySeconds: 30  # Model load time
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target: { type: Utilization, averageUtilization: 70 }
  - type: Pods
    pods:
      metric:
        name: requests_per_second
      target:
        type: AverageValue
        averageValue: "100"

Caching & Pre-computation

Not all frames need real-time inference. Strategic caching:

# Frame deduplication using perceptual hash
import imagehash
from PIL import Image

def phash_key(frame: np.ndarray, threshold: int = 10) -> str:
    pil = Image.fromarray(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
    return str(imagehash.phash(pil))

class InferenceCache:
    def __init__(self, redis_client, ttl_sec: int = 2):
        self.redis = redis_client
        self.ttl = ttl_sec

    def get(self, frame: np.ndarray):
        key = phash_key(frame)
        cached = self.redis.get(key)
        if cached:
            return deserialize(cached), True  # cache hit
        return None, False

    def set(self, frame: np.ndarray, result: dict):
        key = phash_key(frame)
        self.redis.setex(key, self.ttl, serialize(result))

For static cameras: consecutive frames of an empty scene are perceptually identical. Cache hit rate can be 80-90%, reducing GPU load dramatically.


Model Versioning & Blue-Green Deployment

# Model registry pattern (MLflow-compatible)
class ModelRegistry:
    def __init__(self):
        self.versions = {
            "production": load_model("v2.1"),
            "staging":    load_model("v2.2"),  # new version being validated
        }
        self.shadow_fraction = 0.05  # 5% of traffic to shadow

    def infer(self, x, *, shadow: bool = False):
        prod_result = self.versions["production"](x)

        if shadow and random.random() < self.shadow_fraction:
            # Run staging model on same input, log comparison
            staging_result = self.versions["staging"](x)
            log_comparison(prod_result, staging_result)  # async

        return prod_result  # always return production result to user

Interview Questions

Q: Design the batching strategy for a real-time video analytics API with a 100ms SLA.

A: First, I'd establish the math: at 100ms SLA, we can afford at most 80ms queuing + processing time (leaving buffer). With TensorRT YOLOv8 at ~8ms inference time, I'd set max_queue_delay to 50ms with preferred batch sizes of 8, 16, 32. Clients hitting us from 100 cameras at 10 FPS each = 1,000 RPS. At batch_size=32 taking ~22ms, we need 1000/1455 ≈ 1 A100 for steady-state. I'd scale to 2 for redundancy. For burst capacity, set HPA to trigger at 70% queue saturation.

Q: How would you do A/B testing for a new model version in production?

A: I'd use a shadow deployment first: route 5% of production traffic to the new model, compare outputs with production model, but always return production results to users. Monitor: mAP on a labeled evaluation set, latency p99, GPU memory, false positive rate. After 24 hours if metrics look good, promote to a canary (10% of traffic actually served from new model), monitor for user-visible metrics (false alert rate). If metrics hold for 48 hours, full rollout via blue-green: spin up new deployment, switch load balancer, keep old deployment on standby for 24 hours in case of rollback.

Q: How do you handle model warm-up in production?

A: A freshly loaded model has cold CUDA caches — the first few inferences are 2-5× slower than steady-state (GPU kernel compilation happens on first run). Best practices: (1) Kubernetes readiness probe set to 30s+ delay after container start, (2) Run N warmup inferences during lifespan startup before marking service as ready, (3) Use torch.compile or TensorRT which pre-compiles kernels at build time, eliminating first-inference lag entirely.