Phase 7 — MLOps & Production Deployment

Weeks 17-18 of 20 | Bridge from research to production

What This Phase Covers

This phase teaches the entire model lifecycle from training to production: exporting models to portable formats, building inference APIs, containerizing with Docker, and tracking experiments with MLflow. These skills separate ML engineers who can train models from those who can deploy and maintain them.

Labs

#LabCore Skills
01ONNX Export & Optimizationtorch.onnx.export, ONNX graph, TensorRT, FP16/INT8
02FastAPI Inference Serverasync API, dynamic batching, Prometheus metrics
03Docker DeploymentDockerfile, nvidia-docker, multi-stage build
04MLflow Experiment Trackingruns, artifacts, model registry, autolog

Why MLOps Matters for Interviews

Most CV engineers have trained models. Fewer have shipped them. Interviewers at companies like Tesla, NVIDIA, and Apple specifically test:

  • "How would you deploy this model to serve 10,000 requests/second?"
  • "How do you roll back a bad model version?"
  • "How do you detect when your model's accuracy degrades in production?"

Deployment Stack Overview

Model (PyTorch .pth)
    │
    ├─→ ONNX (.onnx)        ← portable, framework-agnostic
    │       └─→ TensorRT    ← NVIDIA GPU optimized engine
    │
    └─→ FastAPI server      ← REST API for inference
            └─→ Docker      ← containerized, reproducible
                    └─→ Kubernetes (k8s)  ← orchestration at scale

GPU/TPU Relevance

  • TensorRT: converts FP32 models to FP16/INT8 with ~3-5× speedup on NVIDIA GPUs
  • Triton Inference Server: production-grade dynamic batching on GPU clusters
  • Quantization-aware training (QAT): prepare models for INT8 before export

Monitoring Checklist

  • Inference latency (p50, p95, p99)
  • GPU utilization and memory
  • Request throughput (RPS)
  • Prediction distribution drift
  • Error rates (5xx, timeout)