Capstone 03 — Production LLM Inference Gateway
Phase: 11 | Difficulty: ⭐⭐⭐⭐⭐
A multi-model, multi-tenant inference gateway suitable for portfolio + interviews. This is the highest-leverage capstone for LLM Inference Engineer, LLM Infrastructure Engineer, and Foundation Model Engineer roles.
Goals
- Serve 2+ models concurrently (e.g., a small + a large) with vLLM as the backend
- Multi-tenant: per-API-key auth + token-bucket rate limiting + per-tenant usage metering
- Smart routing: route by
modelfield, with fallback for overloaded backends - OpenAI-compatible
/v1/chat/completions(streaming + non-streaming) - Observability: Prometheus metrics + OpenTelemetry traces + structured JSON logs
- Load test: sustain 100 concurrent users, p50/p99/throughput dashboards
Architecture
Client ──► [FastAPI Gateway] ──► [Router] ──► [vLLM backend pool]
│ │
├── auth + RL ├── health checks
├── metering ├── circuit breaker
├── trace ID └── retries / fallback
└── stream proxy
Suggested Stack
- API: FastAPI + uvicorn (workers ≥ 4)
- Backends: 2× vLLM containers (e.g.,
Qwen/Qwen2-0.5B-Instruct+Qwen/Qwen2-7B-Instruct) - Cache / RL: Redis
- Observability: Prometheus + Grafana + OpenTelemetry Collector → Tempo/Jaeger
- Load test: Locust or
k6
Deliverables Checklist
-
gateway/FastAPI app with/v1/chat/completions(streaming SSE) -
docker-compose.ymlrunning gateway + 2× vLLM + Redis + Prometheus + Grafana -
loadtest/locustfile.py— 100 concurrent users, mixed prompts -
dashboards/Grafana JSON: TTFT, ITL, throughput, error rate, queue depth -
BENCHMARK.md: p50/p95/p99 latency, tokens/sec, GPU util at sustained load -
ARCHITECTURE.md: design decisions, alternatives considered, scaling plan -
One-line
make deploy(ordocker compose up)
Resume Bullet Pattern
"Designed and deployed an OpenAI-compatible LLM inference gateway serving 2 models with multi-tenant auth, token-bucket rate limiting, and per-tenant metering. Sustained 100 concurrent users at p99 < 2.5s TTFT with vLLM continuous batching, full OpenTelemetry observability, and Grafana dashboards."
Interview Talking Points
- Why FastAPI/uvicorn over Flask (async streaming proxy)
- How vLLM's PagedAttention enables continuous batching (vs static batching's wasted compute)
- Token-bucket vs sliding-window rate limiting tradeoffs
- TTFT vs ITL: why both matter and what knobs affect each
- Circuit breaker patterns for unhealthy backends
- How to scale: horizontal (more vLLM replicas) vs vertical (bigger GPU + tensor parallelism)
Getting Started
This folder is intentionally a scaffold — building this is the assignment. Recommended order:
- Stand up a single vLLM backend with Docker, hit it with curl.
- Build the FastAPI gateway with one route, proxying SSE streams.
- Add a second backend + simple model-name router.
- Add Redis-backed token-bucket rate limiter.
- Add Prometheus middleware + OpenTelemetry.
- Write Locust file, run benchmark, write up
BENCHMARK.md.