Phase 7 — MLOps & Production Deployment
Weeks 17-18 of 20 | Bridge from research to production
What This Phase Covers
This phase teaches the entire model lifecycle from training to production: exporting models to portable formats, building inference APIs, containerizing with Docker, and tracking experiments with MLflow. These skills separate ML engineers who can train models from those who can deploy and maintain them.
Labs
| # | Lab | Core Skills |
|---|---|---|
| 01 | ONNX Export & Optimization | torch.onnx.export, ONNX graph, TensorRT, FP16/INT8 |
| 02 | FastAPI Inference Server | async API, dynamic batching, Prometheus metrics |
| 03 | Docker Deployment | Dockerfile, nvidia-docker, multi-stage build |
| 04 | MLflow Experiment Tracking | runs, artifacts, model registry, autolog |
Why MLOps Matters for Interviews
Most CV engineers have trained models. Fewer have shipped them. Interviewers at companies like Tesla, NVIDIA, and Apple specifically test:
- "How would you deploy this model to serve 10,000 requests/second?"
- "How do you roll back a bad model version?"
- "How do you detect when your model's accuracy degrades in production?"
Deployment Stack Overview
Model (PyTorch .pth)
│
├─→ ONNX (.onnx) ← portable, framework-agnostic
│ └─→ TensorRT ← NVIDIA GPU optimized engine
│
└─→ FastAPI server ← REST API for inference
└─→ Docker ← containerized, reproducible
└─→ Kubernetes (k8s) ← orchestration at scale
GPU/TPU Relevance
- TensorRT: converts FP32 models to FP16/INT8 with ~3-5× speedup on NVIDIA GPUs
- Triton Inference Server: production-grade dynamic batching on GPU clusters
- Quantization-aware training (QAT): prepare models for INT8 before export
Monitoring Checklist
- Inference latency (p50, p95, p99)
- GPU utilization and memory
- Request throughput (RPS)
- Prediction distribution drift
- Error rates (5xx, timeout)