Phase 7 — MLOps & Production Deployment

Weeks 17-18 of 20 | Bridge from research to production

What This Phase Covers

This phase teaches the entire model lifecycle from training to production: exporting models to portable formats, building inference APIs, containerizing with Docker, and tracking experiments with MLflow. These skills separate ML engineers who can train models from those who can deploy and maintain them.

Labs

#	Lab	Core Skills
01	ONNX Export & Optimization	`torch.onnx.export`, ONNX graph, TensorRT, FP16/INT8
02	FastAPI Inference Server	async API, dynamic batching, Prometheus metrics
03	Docker Deployment	Dockerfile, nvidia-docker, multi-stage build
04	MLflow Experiment Tracking	runs, artifacts, model registry, autolog

Why MLOps Matters for Interviews

Most CV engineers have trained models. Fewer have shipped them. Interviewers at companies like Tesla, NVIDIA, and Apple specifically test:

"How would you deploy this model to serve 10,000 requests/second?"
"How do you roll back a bad model version?"
"How do you detect when your model's accuracy degrades in production?"

Deployment Stack Overview

Model (PyTorch .pth)
    │
    ├─→ ONNX (.onnx)        ← portable, framework-agnostic
    │       └─→ TensorRT    ← NVIDIA GPU optimized engine
    │
    └─→ FastAPI server      ← REST API for inference
            └─→ Docker      ← containerized, reproducible
                    └─→ Kubernetes (k8s)  ← orchestration at scale

GPU/TPU Relevance

TensorRT: converts FP32 models to FP16/INT8 with ~3-5× speedup on NVIDIA GPUs
Triton Inference Server: production-grade dynamic batching on GPU clusters
Quantization-aware training (QAT): prepare models for INT8 before export

Monitoring Checklist

Inference latency (p50, p95, p99)
GPU utilization and memory
Request throughput (RPS)
Prediction distribution drift
Error rates (5xx, timeout)

AI Engineer — Role-Based Learning Hub