01 — LLM Inference Gateway @ 100k QPS

Roles: LLM Inference Engineer · LLM Infrastructure Engineer · Foundation Model Engineer Asked at: Anthropic, OpenAI, Together, Fireworks, Anyscale, Databricks, Cohere

1. Clarifying Questions

Functional

What models? (Mix: 1× large 70B-class, 2× medium 7-13B, 3× small 0.5-3B?)
Modality? (Text only, or multimodal?)
Streaming? (Almost always yes — TTFT matters for UX.)
Tool/function calling? Structured outputs (JSON)?
BYO model fine-tunes (LoRA hot-swap), or fixed catalog?

Non-functional

100k QPS — peak or steady? Globally distributed or one region?
SLOs? (Typical: TTFT p99 < 1s, ITL p99 < 50ms, availability 99.9%.)
Max context? (32k? 128k? 1M?) — drives KV-cache memory.
Cost target? ($/Mtok input, $/Mtok output)
Multi-tenant fairness? (Don't let one tenant starve others.)

2. Capacity Estimation

Assumptions: 100k QPS, avg input 800 tok, avg output 200 tok, 70/30 split between 7B and 70B traffic.

Metric	Computation	Value
Tokens/sec (input + output)	100k × 1000	100M tok/s
7B traffic	70k QPS × 1000 tok	70M tok/s
70B traffic	30k QPS × 1000 tok	30M tok/s
7B throughput / H100 (fp8, BS≈128)	~3000 tok/s decode	→ ~23k H100s for 7B
70B throughput / H100 (TP=4, fp8)	~600 tok/s effective per H100	→ ~50k H100s for 70B
Total GPUs		~70k H100s
KV-cache @ 128k ctx, 70B	~10 GB / request	TP+paged required

Sanity: at $4/H100/hr that's ~$2.5B/yr just in compute. So either (a) avg context is much lower, (b) cost per token is high, or (c) you push hard on quantization, batching, speculative decoding, MoE.

3. API & Data Model

Public API (OpenAI-compatible):

POST /v1/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json
{
  "model": "anthropic/claude-3-haiku",
  "messages": [...],
  "max_tokens": 512,
  "stream": true,
  "temperature": 0.7
}

Streaming response: text/event-stream, one SSE event per token (or token batch).

Internal protocol (gateway ↔ backend): gRPC with bidirectional streaming, or HTTP/2. Carry: request_id, tenant_id, prompt tokens, sampling params, deadline.

4. High-Level Architecture

                           ┌──────────────────┐
   Client ──TLS──► [ALB] ─►│  Edge (Envoy)    │ ── auth, rate-limit, WAF
                           └────────┬─────────┘
                                    ▼
                           ┌──────────────────┐
                           │  Gateway (Go)    │ ── routing, batching policy,
                           │  - router        │    metering, fallback,
                           │  - admission ctl │    SSE proxy
                           └────────┬─────────┘
                  ┌─────────────────┼─────────────────┐
                  ▼                 ▼                 ▼
          [Pool: 7B vLLM]   [Pool: 13B vLLM]   [Pool: 70B vLLM TP=4]
          - PagedAttention  - PagedAttention   - PagedAttention
          - cont. batching  - cont. batching   - cont. batching
          - prefix cache    - prefix cache     - prefix cache
                  ▲                 ▲                 ▲
                  └─────────────────┴─────────────────┘
                            ▲
                  ┌─────────┴──────────┐
                  │  Control Plane     │
                  │  - service discovery
                  │  - autoscaler (KPA)
                  │  - LoRA manager
                  └────────────────────┘
   Side-cars: Redis (RL/cache) · Kafka (logs/usage) · Prometheus · OTel

5. Deep Dives

5.1 Continuous Batching (the single biggest lever)

Static batching wastes compute: a batch finishes when its slowest sequence finishes.
Continuous batching (Orca, vLLM): at every decode step, evict finished sequences and admit new ones.
Effect: 3-10× throughput at the same latency, depending on output-length variance.
Knobs: max_num_seqs, max_num_batched_tokens, scheduling policy (FCFS vs prefill-first).

5.2 PagedAttention + KV-Cache Management

KV cache is paged (16-token blocks), like virtual memory.
Eliminates internal fragmentation; enables sharing across requests with same prefix.
Prefix caching: if 80% of system prompts are identical, you save the prefill cost on those tokens.
Memory pressure → admission control: refuse new request if it can't fit, don't preempt mid-decode (or do, with swap-out to CPU).

5.3 Speculative Decoding

Draft model proposes K tokens, target verifies in one forward pass.
Acceptance rate depends on draft/target similarity (Eagle, Medusa, or distilled small model).
2-3× speedup on decode for chat-style traffic; doesn't help prefill.

5.4 Routing & Admission

Model routing by model field (trivial), with small/big cascade as an option.
Admission control: drop with 429 if backend pool queue depth > threshold (avoid death spiral).
Per-tenant token bucket in Redis (Lua script for atomicity); bucket size = burst, refill = sustained QPS.

5.5 Quantization Strategy

Weights: FP8 (or INT8) with per-channel scales — minimal accuracy loss on 70B.
KV cache: FP8 — halves KV memory → halves max-batch-size constraint.
Activations: stay BF16 to preserve accuracy.

6. Bottlenecks & Scaling

Bottleneck	Symptom	Fix
GPU memory (KV cache)	OOM under high concurrency	PagedAttention + FP8 KV + smaller max_seqs
Prefill latency on long contexts	High TTFT	Chunked prefill; prefix cache; speculative prefill
Decode bound by memory bandwidth	Low GPU util but slow	FP8 weights; speculative decoding; MoE routing
Single backend hot-spotted	Tail latency spikes	Power-of-2-choices load balancing; circuit breaker
Gateway CPU on JSON+SSE	High CPU for proxy	Write gateway in Go/Rust; zero-copy stream proxy

7. Failure Modes

Backend crash: health-check at /health every 1s; eject; route to peers; kill in-flight requests with 503.
OOM cascade: admission control with global token-budget; load-shed lowest-priority traffic.
Slow client (back-pressure): bounded outbound buffer; disconnect if buffer fills (the model keeps generating into the void otherwise).
Bad input (jailbreak / 1M-token DoS): max-context check at gateway, before reaching GPU.
Stuck batch (one request never returns): per-request deadline; preempt & evict.

8. Observability

Metrics (every one labeled by model + tenant):

ttft_seconds_bucket (p50/p95/p99)
inter_token_latency_seconds_bucket
tokens_generated_total, tokens_prompt_total
batch_size, running_seqs, waiting_seqs
kv_cache_usage_bytes / kv_cache_total_bytes
gpu_utilization, gpu_memory_utilization
requests_total{status}, request_duration_seconds_bucket

Logs: structured JSON, sampled (1% success, 100% errors), with request_id. Traces: OpenTelemetry from edge → gateway → backend; spans for prefill / each decode step.

9. Cost Model

Per million output tokens served (rough, 7B fp8 on H100):

Compute: ~$0.20
Memory bandwidth dominates → quantization is a direct $ savings
Margin to publish a $0.50/Mtok price ≈ 2.5×; covers reserved-instance overhead, idle capacity, networking

10. Tradeoffs & Alternatives

Choice	Alternative	When to switch
vLLM	TensorRT-LLM	When you need absolute peak throughput on NVIDIA & can pin to specific shapes
vLLM	TGI (HuggingFace)	When tighter HF Hub integration matters more than raw perf
Self-host	Bedrock / Vertex / Together	When you can't justify the GPU capex / on-call burden
FP8 weights	INT4 (AWQ/GPTQ)	When memory is the bottleneck and you accept slight quality loss
Speculative decoding	Bigger batch	When TTFT matters more than throughput (interactive use)
Tensor parallelism	Pipeline parallelism	When the model fits on one node — TP has lower latency

Bonus: 60-Second Pitch

"I'd put an Envoy edge for TLS/auth, a Go gateway for routing and admission, and pools of vLLM backends — one per model size. Continuous batching with PagedAttention gives ~5× throughput vs static; FP8 weights and KV-cache cut memory in half. Per-tenant Redis token-bucket prevents noisy-neighbor problems. Prefix caching eliminates redundant prefill on shared system prompts. Hot LoRA swap for tenant-specific fine-tunes. OTel from end to end, with TTFT and ITL as the headline SLOs. At 100k QPS we're talking ~70k H100s — so the next conversation is about model cascade, speculative decoding, and MoE to bring that number down."