Lab 7-03: Docker Deployment

Learning Goals

Write a production Dockerfile for a PyTorch inference service
Use multi-stage builds to minimize image size
Configure nvidia-docker2 for GPU access in containers
Set resource limits and health checks in docker-compose

Core Concepts

Multi-Stage Dockerfile

# ── Stage 1: Builder (install deps, compile wheels) ──────────────────────────
FROM python:3.11-slim AS builder
WORKDIR /build
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

# ── Stage 2: Runtime (copy only what's needed) ───────────────────────────────
FROM python:3.11-slim AS runtime
WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY solution.py .

ENV PATH=/root/.local/bin:$PATH
ENV PYTHONUNBUFFERED=1

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD curl -f http://localhost:8000/health || exit 1

EXPOSE 8000
CMD ["uvicorn", "solution:app", "--host", "0.0.0.0", "--port", "8000"]

GPU-Enabled Dockerfile

# Use NVIDIA CUDA base image for GPU support
FROM nvcr.io/nvidia/pytorch:24.01-py3 AS runtime
WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY solution.py .
ENV PYTHONUNBUFFERED=1
EXPOSE 8000

CMD ["uvicorn", "solution:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "1"]

docker-compose with GPU

version: "3.9"
services:
  inference:
    build: .
    ports:
      - "8000:8000"
    deploy:
      resources:
        limits:
          cpus: "4.0"
          memory: 8G
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    environment:
      - CUDA_VISIBLE_DEVICES=0
    volumes:
      - ./outputs:/app/outputs
    restart: unless-stopped

Key Docker Commands

# Build image
docker build -t cv-inference:latest .

# Run with GPU
docker run --gpus all -p 8000:8000 cv-inference:latest

# Check container health
docker inspect --format='{{.State.Health.Status}}' <container_id>

# View logs
docker logs -f <container_id>

# Resource stats
docker stats <container_id>

Q: Why use multi-stage builds for ML containers?
A: PyTorch + dependencies can be 3-5 GB. Multi-stage builds separate the compilation/installation environment from the runtime. The final image contains only the installed packages, not build tools, reducing size by 30-60%.

Q: How do you handle model weights in a Docker container?
A: Three options: (1) Bake into the image with COPY — simple but makes the image large; (2) Mount as a Docker volume — flexible, image stays small; (3) Download at startup from S3/GCS — best for production with versioned models. Option 3 is preferred: use boto3 or gsutil to pull the specific model version on container startup.

Q: What's the difference between docker run --gpus all and a CPU container?
A: --gpus all requires nvidia-container-toolkit installed on the host. It exposes CUDA devices to the container. Without it, CUDA_VISIBLE_DEVICES is empty and PyTorch falls back to CPU. In Kubernetes, this is handled by the NVIDIA GPU device plugin (nvidia.com/gpu: 1 in resource requests).

AI Engineer — Role-Based Learning Hub