Capstone 03 — Production LLM Inference Gateway

Phase: 11 | Difficulty: ⭐⭐⭐⭐⭐

A multi-model, multi-tenant inference gateway suitable for portfolio + interviews. This is the highest-leverage capstone for LLM Inference Engineer, LLM Infrastructure Engineer, and Foundation Model Engineer roles.

Goals

Serve 2+ models concurrently (e.g., a small + a large) with vLLM as the backend
Multi-tenant: per-API-key auth + token-bucket rate limiting + per-tenant usage metering
Smart routing: route by model field, with fallback for overloaded backends
OpenAI-compatible /v1/chat/completions (streaming + non-streaming)
Observability: Prometheus metrics + OpenTelemetry traces + structured JSON logs
Load test: sustain 100 concurrent users, p50/p99/throughput dashboards

Architecture

Client ──► [FastAPI Gateway] ──► [Router] ──► [vLLM backend pool]
                │                    │
                ├── auth + RL        ├── health checks
                ├── metering         ├── circuit breaker
                ├── trace ID         └── retries / fallback
                └── stream proxy

Suggested Stack

API: FastAPI + uvicorn (workers ≥ 4)
Backends: 2× vLLM containers (e.g., Qwen/Qwen2-0.5B-Instruct + Qwen/Qwen2-7B-Instruct)
Cache / RL: Redis
Observability: Prometheus + Grafana + OpenTelemetry Collector → Tempo/Jaeger
Load test: Locust or k6

Deliverables Checklist

gateway/ FastAPI app with /v1/chat/completions (streaming SSE)
docker-compose.yml running gateway + 2× vLLM + Redis + Prometheus + Grafana
loadtest/locustfile.py — 100 concurrent users, mixed prompts
dashboards/ Grafana JSON: TTFT, ITL, throughput, error rate, queue depth
BENCHMARK.md: p50/p95/p99 latency, tokens/sec, GPU util at sustained load
ARCHITECTURE.md: design decisions, alternatives considered, scaling plan
One-line make deploy (or docker compose up)

Resume Bullet Pattern

"Designed and deployed an OpenAI-compatible LLM inference gateway serving 2 models with multi-tenant auth, token-bucket rate limiting, and per-tenant metering. Sustained 100 concurrent users at p99 < 2.5s TTFT with vLLM continuous batching, full OpenTelemetry observability, and Grafana dashboards."

Interview Talking Points

Why FastAPI/uvicorn over Flask (async streaming proxy)
How vLLM's PagedAttention enables continuous batching (vs static batching's wasted compute)
Token-bucket vs sliding-window rate limiting tradeoffs
TTFT vs ITL: why both matter and what knobs affect each
Circuit breaker patterns for unhealthy backends
How to scale: horizontal (more vLLM replicas) vs vertical (bigger GPU + tensor parallelism)

Getting Started

This folder is intentionally a scaffold — building this is the assignment. Recommended order:

Stand up a single vLLM backend with Docker, hit it with curl.
Build the FastAPI gateway with one route, proxying SSE streams.
Add a second backend + simple model-name router.
Add Redis-backed token-bucket rate limiter.
Add Prometheus middleware + OpenTelemetry.
Write Locust file, run benchmark, write up BENCHMARK.md.