01 — LLM Inference Gateway @ 100k QPS

Roles: LLM Inference Engineer · LLM Infrastructure Engineer · Foundation Model Engineer Asked at: Anthropic, OpenAI, Together, Fireworks, Anyscale, Databricks, Cohere


1. Clarifying Questions

Functional

  • What models? (Mix: 1× large 70B-class, 2× medium 7-13B, 3× small 0.5-3B?)
  • Modality? (Text only, or multimodal?)
  • Streaming? (Almost always yes — TTFT matters for UX.)
  • Tool/function calling? Structured outputs (JSON)?
  • BYO model fine-tunes (LoRA hot-swap), or fixed catalog?

Non-functional

  • 100k QPS — peak or steady? Globally distributed or one region?
  • SLOs? (Typical: TTFT p99 < 1s, ITL p99 < 50ms, availability 99.9%.)
  • Max context? (32k? 128k? 1M?) — drives KV-cache memory.
  • Cost target? ($/Mtok input, $/Mtok output)
  • Multi-tenant fairness? (Don't let one tenant starve others.)

2. Capacity Estimation

Assumptions: 100k QPS, avg input 800 tok, avg output 200 tok, 70/30 split between 7B and 70B traffic.

MetricComputationValue
Tokens/sec (input + output)100k × 1000100M tok/s
7B traffic70k QPS × 1000 tok70M tok/s
70B traffic30k QPS × 1000 tok30M tok/s
7B throughput / H100 (fp8, BS≈128)~3000 tok/s decode~23k H100s for 7B
70B throughput / H100 (TP=4, fp8)~600 tok/s effective per H100~50k H100s for 70B
Total GPUs~70k H100s
KV-cache @ 128k ctx, 70B~10 GB / requestTP+paged required

Sanity: at $4/H100/hr that's ~$2.5B/yr just in compute. So either (a) avg context is much lower, (b) cost per token is high, or (c) you push hard on quantization, batching, speculative decoding, MoE.


3. API & Data Model

Public API (OpenAI-compatible):

POST /v1/chat/completions
Authorization: Bearer sk-...
Content-Type: application/json
{
  "model": "anthropic/claude-3-haiku",
  "messages": [...],
  "max_tokens": 512,
  "stream": true,
  "temperature": 0.7
}

Streaming response: text/event-stream, one SSE event per token (or token batch).

Internal protocol (gateway ↔ backend): gRPC with bidirectional streaming, or HTTP/2. Carry: request_id, tenant_id, prompt tokens, sampling params, deadline.


4. High-Level Architecture

                           ┌──────────────────┐
   Client ──TLS──► [ALB] ─►│  Edge (Envoy)    │ ── auth, rate-limit, WAF
                           └────────┬─────────┘
                                    ▼
                           ┌──────────────────┐
                           │  Gateway (Go)    │ ── routing, batching policy,
                           │  - router        │    metering, fallback,
                           │  - admission ctl │    SSE proxy
                           └────────┬─────────┘
                  ┌─────────────────┼─────────────────┐
                  ▼                 ▼                 ▼
          [Pool: 7B vLLM]   [Pool: 13B vLLM]   [Pool: 70B vLLM TP=4]
          - PagedAttention  - PagedAttention   - PagedAttention
          - cont. batching  - cont. batching   - cont. batching
          - prefix cache    - prefix cache     - prefix cache
                  ▲                 ▲                 ▲
                  └─────────────────┴─────────────────┘
                            ▲
                  ┌─────────┴──────────┐
                  │  Control Plane     │
                  │  - service discovery
                  │  - autoscaler (KPA)
                  │  - LoRA manager
                  └────────────────────┘
   Side-cars: Redis (RL/cache) · Kafka (logs/usage) · Prometheus · OTel

5. Deep Dives

5.1 Continuous Batching (the single biggest lever)

  • Static batching wastes compute: a batch finishes when its slowest sequence finishes.
  • Continuous batching (Orca, vLLM): at every decode step, evict finished sequences and admit new ones.
  • Effect: 3-10× throughput at the same latency, depending on output-length variance.
  • Knobs: max_num_seqs, max_num_batched_tokens, scheduling policy (FCFS vs prefill-first).

5.2 PagedAttention + KV-Cache Management

  • KV cache is paged (16-token blocks), like virtual memory.
  • Eliminates internal fragmentation; enables sharing across requests with same prefix.
  • Prefix caching: if 80% of system prompts are identical, you save the prefill cost on those tokens.
  • Memory pressure → admission control: refuse new request if it can't fit, don't preempt mid-decode (or do, with swap-out to CPU).

5.3 Speculative Decoding

  • Draft model proposes K tokens, target verifies in one forward pass.
  • Acceptance rate depends on draft/target similarity (Eagle, Medusa, or distilled small model).
  • 2-3× speedup on decode for chat-style traffic; doesn't help prefill.

5.4 Routing & Admission

  • Model routing by model field (trivial), with small/big cascade as an option.
  • Admission control: drop with 429 if backend pool queue depth > threshold (avoid death spiral).
  • Per-tenant token bucket in Redis (Lua script for atomicity); bucket size = burst, refill = sustained QPS.

5.5 Quantization Strategy

  • Weights: FP8 (or INT8) with per-channel scales — minimal accuracy loss on 70B.
  • KV cache: FP8 — halves KV memory → halves max-batch-size constraint.
  • Activations: stay BF16 to preserve accuracy.

6. Bottlenecks & Scaling

BottleneckSymptomFix
GPU memory (KV cache)OOM under high concurrencyPagedAttention + FP8 KV + smaller max_seqs
Prefill latency on long contextsHigh TTFTChunked prefill; prefix cache; speculative prefill
Decode bound by memory bandwidthLow GPU util but slowFP8 weights; speculative decoding; MoE routing
Single backend hot-spottedTail latency spikesPower-of-2-choices load balancing; circuit breaker
Gateway CPU on JSON+SSEHigh CPU for proxyWrite gateway in Go/Rust; zero-copy stream proxy

7. Failure Modes

  • Backend crash: health-check at /health every 1s; eject; route to peers; kill in-flight requests with 503.
  • OOM cascade: admission control with global token-budget; load-shed lowest-priority traffic.
  • Slow client (back-pressure): bounded outbound buffer; disconnect if buffer fills (the model keeps generating into the void otherwise).
  • Bad input (jailbreak / 1M-token DoS): max-context check at gateway, before reaching GPU.
  • Stuck batch (one request never returns): per-request deadline; preempt & evict.

8. Observability

Metrics (every one labeled by model + tenant):

  • ttft_seconds_bucket (p50/p95/p99)
  • inter_token_latency_seconds_bucket
  • tokens_generated_total, tokens_prompt_total
  • batch_size, running_seqs, waiting_seqs
  • kv_cache_usage_bytes / kv_cache_total_bytes
  • gpu_utilization, gpu_memory_utilization
  • requests_total{status}, request_duration_seconds_bucket

Logs: structured JSON, sampled (1% success, 100% errors), with request_id. Traces: OpenTelemetry from edge → gateway → backend; spans for prefill / each decode step.


9. Cost Model

Per million output tokens served (rough, 7B fp8 on H100):

  • Compute: ~$0.20
  • Memory bandwidth dominates → quantization is a direct $ savings
  • Margin to publish a $0.50/Mtok price ≈ 2.5×; covers reserved-instance overhead, idle capacity, networking

10. Tradeoffs & Alternatives

ChoiceAlternativeWhen to switch
vLLMTensorRT-LLMWhen you need absolute peak throughput on NVIDIA & can pin to specific shapes
vLLMTGI (HuggingFace)When tighter HF Hub integration matters more than raw perf
Self-hostBedrock / Vertex / TogetherWhen you can't justify the GPU capex / on-call burden
FP8 weightsINT4 (AWQ/GPTQ)When memory is the bottleneck and you accept slight quality loss
Speculative decodingBigger batchWhen TTFT matters more than throughput (interactive use)
Tensor parallelismPipeline parallelismWhen the model fits on one node — TP has lower latency

Bonus: 60-Second Pitch

"I'd put an Envoy edge for TLS/auth, a Go gateway for routing and admission, and pools of vLLM backends — one per model size. Continuous batching with PagedAttention gives ~5× throughput vs static; FP8 weights and KV-cache cut memory in half. Per-tenant Redis token-bucket prevents noisy-neighbor problems. Prefix caching eliminates redundant prefill on shared system prompts. Hot LoRA swap for tenant-specific fine-tunes. OTel from end to end, with TTFT and ITL as the headline SLOs. At 100k QPS we're talking ~70k H100s — so the next conversation is about model cascade, speculative decoding, and MoE to bring that number down."