Capstone 03 — Production LLM Inference Gateway

Phase: 11 | Difficulty: ⭐⭐⭐⭐⭐

A multi-model, multi-tenant inference gateway suitable for portfolio + interviews. This is the highest-leverage capstone for LLM Inference Engineer, LLM Infrastructure Engineer, and Foundation Model Engineer roles.

Goals

  • Serve 2+ models concurrently (e.g., a small + a large) with vLLM as the backend
  • Multi-tenant: per-API-key auth + token-bucket rate limiting + per-tenant usage metering
  • Smart routing: route by model field, with fallback for overloaded backends
  • OpenAI-compatible /v1/chat/completions (streaming + non-streaming)
  • Observability: Prometheus metrics + OpenTelemetry traces + structured JSON logs
  • Load test: sustain 100 concurrent users, p50/p99/throughput dashboards

Architecture

Client ──► [FastAPI Gateway] ──► [Router] ──► [vLLM backend pool]
                │                    │
                ├── auth + RL        ├── health checks
                ├── metering         ├── circuit breaker
                ├── trace ID         └── retries / fallback
                └── stream proxy

Suggested Stack

  • API: FastAPI + uvicorn (workers ≥ 4)
  • Backends: 2× vLLM containers (e.g., Qwen/Qwen2-0.5B-Instruct + Qwen/Qwen2-7B-Instruct)
  • Cache / RL: Redis
  • Observability: Prometheus + Grafana + OpenTelemetry Collector → Tempo/Jaeger
  • Load test: Locust or k6

Deliverables Checklist

  • gateway/ FastAPI app with /v1/chat/completions (streaming SSE)
  • docker-compose.yml running gateway + 2× vLLM + Redis + Prometheus + Grafana
  • loadtest/locustfile.py — 100 concurrent users, mixed prompts
  • dashboards/ Grafana JSON: TTFT, ITL, throughput, error rate, queue depth
  • BENCHMARK.md: p50/p95/p99 latency, tokens/sec, GPU util at sustained load
  • ARCHITECTURE.md: design decisions, alternatives considered, scaling plan
  • One-line make deploy (or docker compose up)

Resume Bullet Pattern

"Designed and deployed an OpenAI-compatible LLM inference gateway serving 2 models with multi-tenant auth, token-bucket rate limiting, and per-tenant metering. Sustained 100 concurrent users at p99 < 2.5s TTFT with vLLM continuous batching, full OpenTelemetry observability, and Grafana dashboards."

Interview Talking Points

  • Why FastAPI/uvicorn over Flask (async streaming proxy)
  • How vLLM's PagedAttention enables continuous batching (vs static batching's wasted compute)
  • Token-bucket vs sliding-window rate limiting tradeoffs
  • TTFT vs ITL: why both matter and what knobs affect each
  • Circuit breaker patterns for unhealthy backends
  • How to scale: horizontal (more vLLM replicas) vs vertical (bigger GPU + tensor parallelism)

Getting Started

This folder is intentionally a scaffold — building this is the assignment. Recommended order:

  1. Stand up a single vLLM backend with Docker, hit it with curl.
  2. Build the FastAPI gateway with one route, proxying SSE streams.
  3. Add a second backend + simple model-name router.
  4. Add Redis-backed token-bucket rate limiter.
  5. Add Prometheus middleware + OpenTelemetry.
  6. Write Locust file, run benchmark, write up BENCHMARK.md.