01 — LLM Inference Gateway @ 100k QPS
Roles: LLM Inference Engineer · LLM Infrastructure Engineer · Foundation Model Engineer Asked at: Anthropic, OpenAI, Together, Fireworks, Anyscale, Databricks, Cohere
1. Clarifying Questions
Functional
- What models? (Mix: 1× large 70B-class, 2× medium 7-13B, 3× small 0.5-3B?)
- Modality? (Text only, or multimodal?)
- Streaming? (Almost always yes — TTFT matters for UX.)
- Tool/function calling? Structured outputs (JSON)?
- BYO model fine-tunes (LoRA hot-swap), or fixed catalog?
Non-functional
- 100k QPS — peak or steady? Globally distributed or one region?
- SLOs? (Typical: TTFT p99 < 1s, ITL p99 < 50ms, availability 99.9%.)
- Max context? (32k? 128k? 1M?) — drives KV-cache memory.
- Cost target? ($/Mtok input, $/Mtok output)
- Multi-tenant fairness? (Don't let one tenant starve others.)
2. Capacity Estimation
Assumptions: 100k QPS, avg input 800 tok, avg output 200 tok, 70/30 split between 7B and 70B traffic.
| Metric | Computation | Value |
|---|---|---|
| Tokens/sec (input + output) | 100k × 1000 | 100M tok/s |
| 7B traffic | 70k QPS × 1000 tok | 70M tok/s |
| 70B traffic | 30k QPS × 1000 tok | 30M tok/s |
| 7B throughput / H100 (fp8, BS≈128) | ~3000 tok/s decode | → ~23k H100s for 7B |
| 70B throughput / H100 (TP=4, fp8) | ~600 tok/s effective per H100 | → ~50k H100s for 70B |
| Total GPUs | ~70k H100s | |
| KV-cache @ 128k ctx, 70B | ~10 GB / request | TP+paged required |
Sanity: at $4/H100/hr that's ~$2.5B/yr just in compute. So either (a) avg context is much lower, (b) cost per token is high, or (c) you push hard on quantization, batching, speculative decoding, MoE.
3. API & Data Model
Public API (OpenAI-compatible):
POST /v1/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json
{
"model": "anthropic/claude-3-haiku",
"messages": [...],
"max_tokens": 512,
"stream": true,
"temperature": 0.7
}
Streaming response: text/event-stream, one SSE event per token (or token batch).
Internal protocol (gateway ↔ backend): gRPC with bidirectional streaming, or HTTP/2. Carry: request_id, tenant_id, prompt tokens, sampling params, deadline.
4. High-Level Architecture
┌──────────────────┐
Client ──TLS──► [ALB] ─►│ Edge (Envoy) │ ── auth, rate-limit, WAF
└────────┬─────────┘
▼
┌──────────────────┐
│ Gateway (Go) │ ── routing, batching policy,
│ - router │ metering, fallback,
│ - admission ctl │ SSE proxy
└────────┬─────────┘
┌─────────────────┼─────────────────┐
▼ ▼ ▼
[Pool: 7B vLLM] [Pool: 13B vLLM] [Pool: 70B vLLM TP=4]
- PagedAttention - PagedAttention - PagedAttention
- cont. batching - cont. batching - cont. batching
- prefix cache - prefix cache - prefix cache
▲ ▲ ▲
└─────────────────┴─────────────────┘
▲
┌─────────┴──────────┐
│ Control Plane │
│ - service discovery
│ - autoscaler (KPA)
│ - LoRA manager
└────────────────────┘
Side-cars: Redis (RL/cache) · Kafka (logs/usage) · Prometheus · OTel
5. Deep Dives
5.1 Continuous Batching (the single biggest lever)
- Static batching wastes compute: a batch finishes when its slowest sequence finishes.
- Continuous batching (Orca, vLLM): at every decode step, evict finished sequences and admit new ones.
- Effect: 3-10× throughput at the same latency, depending on output-length variance.
- Knobs:
max_num_seqs,max_num_batched_tokens, scheduling policy (FCFS vs prefill-first).
5.2 PagedAttention + KV-Cache Management
- KV cache is paged (16-token blocks), like virtual memory.
- Eliminates internal fragmentation; enables sharing across requests with same prefix.
- Prefix caching: if 80% of system prompts are identical, you save the prefill cost on those tokens.
- Memory pressure → admission control: refuse new request if it can't fit, don't preempt mid-decode (or do, with swap-out to CPU).
5.3 Speculative Decoding
- Draft model proposes K tokens, target verifies in one forward pass.
- Acceptance rate depends on draft/target similarity (Eagle, Medusa, or distilled small model).
- 2-3× speedup on decode for chat-style traffic; doesn't help prefill.
5.4 Routing & Admission
- Model routing by
modelfield (trivial), with small/big cascade as an option. - Admission control: drop with 429 if backend pool queue depth > threshold (avoid death spiral).
- Per-tenant token bucket in Redis (Lua script for atomicity); bucket size = burst, refill = sustained QPS.
5.5 Quantization Strategy
- Weights: FP8 (or INT8) with per-channel scales — minimal accuracy loss on 70B.
- KV cache: FP8 — halves KV memory → halves max-batch-size constraint.
- Activations: stay BF16 to preserve accuracy.
6. Bottlenecks & Scaling
| Bottleneck | Symptom | Fix |
|---|---|---|
| GPU memory (KV cache) | OOM under high concurrency | PagedAttention + FP8 KV + smaller max_seqs |
| Prefill latency on long contexts | High TTFT | Chunked prefill; prefix cache; speculative prefill |
| Decode bound by memory bandwidth | Low GPU util but slow | FP8 weights; speculative decoding; MoE routing |
| Single backend hot-spotted | Tail latency spikes | Power-of-2-choices load balancing; circuit breaker |
| Gateway CPU on JSON+SSE | High CPU for proxy | Write gateway in Go/Rust; zero-copy stream proxy |
7. Failure Modes
- Backend crash: health-check at /health every 1s; eject; route to peers; kill in-flight requests with 503.
- OOM cascade: admission control with global token-budget; load-shed lowest-priority traffic.
- Slow client (back-pressure): bounded outbound buffer; disconnect if buffer fills (the model keeps generating into the void otherwise).
- Bad input (jailbreak / 1M-token DoS): max-context check at gateway, before reaching GPU.
- Stuck batch (one request never returns): per-request deadline; preempt & evict.
8. Observability
Metrics (every one labeled by model + tenant):
ttft_seconds_bucket(p50/p95/p99)inter_token_latency_seconds_buckettokens_generated_total,tokens_prompt_totalbatch_size,running_seqs,waiting_seqskv_cache_usage_bytes / kv_cache_total_bytesgpu_utilization,gpu_memory_utilizationrequests_total{status},request_duration_seconds_bucket
Logs: structured JSON, sampled (1% success, 100% errors), with request_id. Traces: OpenTelemetry from edge → gateway → backend; spans for prefill / each decode step.
9. Cost Model
Per million output tokens served (rough, 7B fp8 on H100):
- Compute: ~$0.20
- Memory bandwidth dominates → quantization is a direct $ savings
- Margin to publish a $0.50/Mtok price ≈ 2.5×; covers reserved-instance overhead, idle capacity, networking
10. Tradeoffs & Alternatives
| Choice | Alternative | When to switch |
|---|---|---|
| vLLM | TensorRT-LLM | When you need absolute peak throughput on NVIDIA & can pin to specific shapes |
| vLLM | TGI (HuggingFace) | When tighter HF Hub integration matters more than raw perf |
| Self-host | Bedrock / Vertex / Together | When you can't justify the GPU capex / on-call burden |
| FP8 weights | INT4 (AWQ/GPTQ) | When memory is the bottleneck and you accept slight quality loss |
| Speculative decoding | Bigger batch | When TTFT matters more than throughput (interactive use) |
| Tensor parallelism | Pipeline parallelism | When the model fits on one node — TP has lower latency |
Bonus: 60-Second Pitch
"I'd put an Envoy edge for TLS/auth, a Go gateway for routing and admission, and pools of vLLM backends — one per model size. Continuous batching with PagedAttention gives ~5× throughput vs static; FP8 weights and KV-cache cut memory in half. Per-tenant Redis token-bucket prevents noisy-neighbor problems. Prefix caching eliminates redundant prefill on shared system prompts. Hot LoRA swap for tenant-specific fine-tunes. OTel from end to end, with TTFT and ITL as the headline SLOs. At 100k QPS we're talking ~70k H100s — so the next conversation is about model cascade, speculative decoding, and MoE to bring that number down."