🛸 Hitchhiker's Guide — Phase 9: Inference Optimization & Serving

Read this if: You can train a 7B model but you don't yet know what a KV cache is, why batch size matters at inference, what continuous batching is doing, why FlashAttention matters, or how vLLM achieves 5–20× the throughput of naïve model.generate(). This is the most economically valuable phase of the curriculum: a 30% throughput gain on 1000 H100s saves millions per year.


0. The 30-second mental model

LLM inference has two fundamentally different phases per request:

  1. Prefill (compute-bound): process the entire prompt in one pass. Computes the KV cache for every prompt token. FLOPs scale with prompt_length × params.
  2. Decode (memory-bandwidth-bound): generate one token at a time. Reads the entire model weights + KV cache from HBM each step. Memory bandwidth, not FLOPs, is the bottleneck.

Throughput-optimal serving is about (a) keeping the GPU busy with continuous batching so decode steps aggregate many requests' work, (b) shrinking the KV cache with PagedAttention / quantization / GQA so you can fit a bigger batch, and (c) reducing the steps per request via speculative decoding.

By the end of Phase 9 you should:

  • Compute KV-cache memory and prefill/decode FLOPs by hand for any model.
  • Implement a KV cache from scratch in PyTorch (the lab does this).
  • Explain PagedAttention, continuous batching, prefix caching, chunked prefill.
  • Explain FlashAttention's online softmax.
  • Compare quantization formats (INT8, FP8 E4M3/E5M2, INT4 AWQ/GPTQ).
  • Explain speculative decoding's accept-reject math.
  • Be able to operate vLLM, TGI, or SGLang in production.

1. The two phases of LLM inference

1.1 Prefill

For a prompt of L tokens, run a forward pass over all L positions in parallel (just like training). Output: logits at the last position (for the next-token sampler) and the full KV cache for all L positions.

  • FLOPs ≈ 2 × L × N_params. Linear in prompt length.
  • Compute-bound on modern GPUs.
  • Latency: ~50ms for 1k-token prompt on 7B BF16 / H100.

1.2 Decode

For each subsequent token, run a forward pass on just one position (the new token), reading the cached K and V from previous positions. Output: one set of logits.

  • FLOPs ≈ 2 × N_params per token. Tiny per token.
  • Memory-bandwidth-bound: must read all N_params weight bytes and the entire KV cache from HBM each step.
  • Latency: ~30ms per token on 7B BF16 / H100 batch=1.

1.3 The arithmetic-intensity argument

Arithmetic intensity = FLOPs / bytes-read. H100 SXM:

  • Compute: ~989 TFLOP/s (BF16).
  • HBM bandwidth: ~3.35 TB/s.
  • Crossover intensity: 989e12 / 3.35e12 ≈ 295 FLOP/byte to be compute-bound.

For batch=1 decode: each weight byte produces 2 FLOPs. Intensity = 2. We're 100× memory-bound.

For batch=128 decode: each weight byte is reused across 128 requests. Intensity = 256. Now we're approaching compute-bound. This is the entire reason continuous batching exists.


2. The KV cache

2.1 What it stores

For each layer, each request, each prior token: the K and V projections. Shape per request:

(n_layers, 2, n_kv_heads, seq_len, head_dim)

With 2 for K and V. With GQA, n_kv_heads < n_query_heads, shrinking the cache.

2.2 The math you must know cold

KV cache size in bytes per request:

$$ \text{bytes} = 2 \cdot n_{\text{layers}} \cdot n_{\text{kv heads}} \cdot d_{\text{head}} \cdot \text{seq len} \cdot \text{bytes per element} $$

Worked example: Llama-3 8B (32 layers, 8 KV heads, 128 head dim, BF16 = 2 bytes), seq_len = 8192:

$$ 2 \cdot 32 \cdot 8 \cdot 128 \cdot 8192 \cdot 2 \approx 1.07 \text{ GB per request} $$

One request at 8k context costs over a gig of HBM beyond the model weights. You will be quizzed on this calculation.

2.3 Implementation pattern

Each Block keeps a LayerCache with K and V tensors that grow per step:

class LayerCache:
    def __init__(self, max_seq, n_kv_heads, head_dim, dtype, device):
        self.K = torch.zeros(max_seq, n_kv_heads, head_dim, dtype=dtype, device=device)
        self.V = torch.zeros(max_seq, n_kv_heads, head_dim, dtype=dtype, device=device)
        self.length = 0

In attention:

  • Prefill (T > 1): write all K, V for positions [0, T). Compute attention over [0, T).
  • Decode (T = 1): append one K, V at position length. Compute Q against K[:length+1], V[:length+1].

Lab 01 implements this end-to-end. Compare generate_no_cache (recomputes everything every step — O(T²) work per step) vs generate_kv_cache (O(T) work per step). The speedup at length 256 is ~50×.

2.4 PagedAttention (vLLM, Kwon et al., 2023)

The naïve KV cache pre-allocates (max_seq, ...) per request. This wastes memory: short requests waste their tail. PagedAttention treats the KV cache like virtual memory:

  • Split into fixed-size blocks (e.g., 16 tokens each).
  • Allocate blocks on demand from a pool.
  • A block table maps logical token positions to physical block addresses.
  • Attention kernel reads via the block table (one extra indirection).

Wins:

  1. No internal fragmentation — only allocate what's used.
  2. Prefix sharing — multiple requests sharing a system prompt share the same physical blocks (copy-on-write semantics).
  3. 2–4× throughput vs naïve, because you can fit more concurrent requests.

2.5 Prefix caching

Common in chat APIs: many requests share a long system prompt. Hash the prompt's KV cache; reuse on the next request. vLLM's --enable-prefix-caching. Massive win for chat assistants with long instructions.


3. Continuous batching

3.1 The problem with static batching

Static batching: collect N requests, batch them, run prefill+decode together until all N finish. The slowest request blocks the entire batch from completing. GPU goes idle as faster requests finish but slots can't be refilled.

3.2 The continuous-batching solution (Yu et al., Orca, 2022)

Schedule at the token level. After every decode step, decisions are remade:

  • A request that just completed (hit EOS) frees its slot.
  • A new request waiting in the queue can be inserted mid-batch by adding its prefill on the side.
  • Different requests in the batch can be at different sequence positions — masking handles correctness.

Throughput improvement vs static batching: typically 3–10×. This is the algorithm that makes modern serving viable.

3.3 Chunked prefill

Long prompts have expensive prefills (compute-bound) that block decodes (memory-bound) from the rest of the batch. Chunked prefill splits a long prefill into chunks of chunk_size (e.g., 512 tokens) and interleaves them with decode steps from other requests, keeping both compute and memory utilized. Used by SGLang, recent vLLM versions.

3.4 Configuring vLLM

Two knobs that matter most:

  • max_num_seqs: max concurrent requests in the batch.
  • max_num_batched_tokens: total token budget per scheduling step (sum of chunk-prefill tokens + decode tokens).

Tune for your workload — chat (mostly decode) wants high max_num_seqs; batch-translation (mostly prefill) wants high max_num_batched_tokens.


4. FlashAttention

4.1 The problem

Standard attention materializes the (T, T) score matrix in HBM, then softmaxes, then matmuls with V. For T = 8192, that's a 256 MB tensor read+written per layer per head. Memory bandwidth is the bottleneck even when FLOPs would fit.

4.2 The trick — tiled, fused, online softmax

FlashAttention (Dao et al., 2022) tiles Q and K, V into blocks that fit in SRAM, and computes the softmax incrementally using the online softmax algorithm:

For each Q tile:
  initialize running max m = -inf, running denominator l = 0, running output o = 0
  For each K, V tile:
    s = q · k          # block of scores
    m_new = max(m, max(s))
    correction = exp(m - m_new)
    l = l * correction + sum(exp(s - m_new))
    o = o * correction + exp(s - m_new) @ v
    m = m_new
  o = o / l

The full (T, T) matrix is never materialized. HBM reads/writes drop ~10×. Wall-clock attention drops 2–4×. Memory drops from O(T²) to O(T).

4.3 v1 → v2 → v3

  • v1 (2022): the original. Fused softmax + matmul.
  • v2 (2023): better work partitioning across SMs; ~2× faster than v1.
  • v3 (2024): Hopper-specific (TMA, FP8 paths, asynchronous copy). ~1.5× faster than v2 on H100.

4.4 PyTorch integration

torch.nn.functional.scaled_dot_product_attention dispatches to FlashAttention v2 when shapes/dtype permit. Always use this in modern code rather than hand-rolling.


5. Quantization

5.1 Why quantize

Smaller weights → less HBM bandwidth required per decode step → faster decode. Also lets you fit bigger models on smaller GPUs.

5.2 Datatypes

FormatBitsNotes
FP1616Training default for older models
BF1616Training default modern
INT8 (W8A8)8Weights and activations both quantized
FP8 E4M38H100+; small range, more precision; for weights/activations
FP8 E5M28Wider range, lower precision; for gradients
INT4 (W4A16)4Weights only; activations stay BF16. AWQ/GPTQ
NF44Information-optimal for normal-distributed weights (QLoRA)
INT2/Ternary2Aggressive; quality drops

5.3 PTQ vs QAT

  • Post-Training Quantization (PTQ): quantize a trained model with a small calibration set (~512 samples). Fast. Easy. GPTQ and AWQ are PTQ.
  • Quantization-Aware Training (QAT): simulate quantization during training so weights adapt. More expensive, sometimes higher quality.

5.4 GPTQ vs AWQ

  • GPTQ (Frantar et al., 2022): row-by-row second-order quantization minimizing reconstruction error. Slightly slower to apply, slightly better accuracy.
  • AWQ (Lin et al., 2023): observes that ~1% of weights matter most for output. Scales those weights up before quantization (with a corresponding inverse scale on activations). Good balance.

5.5 What breaks at 4-bit

  • The LM head is sensitive — keep it in BF16/FP16.
  • Embedding layer sometimes too.
  • For very small models (<1B), 4-bit quality degrades faster than for large models.

5.6 Smoothing for FP8

FP8 (especially E4M3 with mantissa precision 3) needs careful per-tensor or per-block scaling. Tools: NVIDIA transformer_engine. Active research area.


6. Speculative decoding

6.1 The trick

Decode is sequential — one token per forward pass. What if a smaller draft model could propose k tokens cheaply, and the big target model verifies them all in one parallel forward pass?

6.2 The accept/reject algorithm (Leviathan et al., 2023; Chen et al., 2023)

Draft proposes tokens t_1, …, t_k with probabilities q(t_i | context). Target evaluates the same positions in parallel, getting p(t_i | context).

For each i, accept t_i with probability:

$$ \min!\left(1, \frac{p(t_i)}{q(t_i)}\right) $$

If rejected: sample a replacement from the residual distribution max(0, p - q)/(1 - sum_acc) and stop. This procedure is provably equivalent to sampling from the target — same distribution, same temperature semantics.

Best case: all k tokens accepted → k tokens generated for ~1 target forward. Speedup ~2–3× wall clock when draft is well-aligned with target.

6.3 Variants

  • Medusa (Cai et al., 2024) — train extra "Medusa heads" on the target model itself to predict multiple future tokens. No separate draft model.
  • EAGLE / EAGLE-2 (Li et al., 2024) — autoregressive draft head trained to mimic target hidden states; better acceptance rates.
  • Lookahead decoding — n-gram heuristics; no extra training.

6.4 When does it help?

  • Most beneficial at batch=1 (single user, low concurrency).
  • Diminishing returns at high batch size — your batch is already amortizing the memory bandwidth cost.
  • Doesn't help at all if draft is poorly aligned (acceptance rate too low).

7. Other inference tricks

7.1 Tensor / Pipeline parallelism for inference

If a model doesn't fit on one GPU, shard it across several:

  • Tensor parallel (TP): split each weight matrix across GPUs (column-parallel for QKV+gate+up; row-parallel for O+down). Requires AllReduce per layer. Use within a single node (NVLink). vLLM --tensor-parallel-size 8.
  • Pipeline parallel (PP): assign different layers to different GPUs. Useful across nodes but introduces bubble overhead.

7.2 Speculative batching, multi-LoRA

  • Serve many LoRA adapters from a single base model, switching per request (Punica, S-LoRA).
  • Useful for per-tenant fine-tunes.

7.3 Streaming output

Always stream tokens to the client. Improves perceived latency dramatically. SSE or WebSockets at the gateway.

7.4 Long-context tricks

  • YaRN / NTK-aware RoPE scaling — extend a 4k-trained model to 32k+ at inference (covered Phase 4).
  • Sliding window attention (Mistral 7B) — only attend to last W tokens; bounded KV cache.
  • Ring Attention — distributed sequence parallelism for million-token contexts.

8. The lab walkthrough (lab-01-kv-cache-from-scratch)

8.1 What you'll build

A self-contained mini-GPT (similar to Phase 4) plus:

  • A LayerCache with K and V buffers and a length counter.
  • Modified CausalSelfAttention that accepts an optional cache; if present, appends new K, V at length, computes attention against the prefix [0, length+1).
  • generate_no_cache(model, prompt, max_new) — naive baseline that re-runs the full forward over the growing context every step.
  • generate_kv_cache(model, prompt, max_new) — uses the cache; prefill once, then incremental decode.
  • A timing benchmark comparing the two.

8.2 Things to read carefully

  • Why is the prefill case T > 1 and incremental decode T = 1? Because at prefill we have the whole prompt, at decode just the new token.
  • The new K, V positions are computed at indices [length, length + T).
  • The Q for position length attends to K at [0, length+1) — a causal mask is not needed at decode (only one query, all keys are by construction prior).
  • Why is batching multiple requests with a unified KV cache hard in pure PyTorch? (Different lengths — you'd need padding or PagedAttention. Why vLLM exists.)

8.3 Expected numbers

For a 6-layer, 384-dim model on a 4090:

  • 256 tokens generated:
  • generate_no_cache: ~3000ms (work scales as O(T²)).
  • generate_kv_cache: ~80ms (O(T)).
  • Speedup: ~40×.

8.4 Optional extensions

  • Add prefix sharing — two requests with the same prompt prefix share the same cache rows.
  • Implement a block-paged KV cache.
  • Implement temperature + top-p sampling.
  • Add a tiny draft model and implement speculative decoding against the target.

9. Production-grade serving stacks

9.1 vLLM (UC Berkeley → community)

The dominant open-source LLM server in 2025. PagedAttention, continuous batching, prefix caching, speculative decoding, multi-LoRA, OpenAI-compatible API. Read the vLLM source code — it's the best free curriculum for inference engineering.

9.2 TGI (Hugging Face)

Text Generation Inference. Production-tested at HF scale. Slightly less feature velocity than vLLM but rock-solid; Rust scheduler with Python model code.

9.3 SGLang (LMSys)

Newer; aggressive on chunked prefill and structured generation (regex/JSON-constrained decoding). RadixAttention for prefix-cache reuse across many short requests.

9.4 TensorRT-LLM (NVIDIA)

NVIDIA's optimized inference engine. Best raw performance on NVIDIA hardware via custom CUDA kernels and FP8 paths. More complex to operate.

9.5 llama.cpp / ggml

CPU and consumer-GPU inference; INT4/INT8 quantized; runs on Macs, phones, edge. The dominant local inference stack.

9.6 vLLM's contribution to the community

vLLM open-sourced the production-grade reference implementation of PagedAttention; this single contribution may be worth tens of millions in industry-wide compute savings.


10. References

Required:

  • Kwon et al. (2023), Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM paper).
  • Yu et al. (2022), Orca: A Distributed Serving System for Transformer-Based Generative Models (continuous batching).
  • Dao et al. (2022), FlashAttention. Dao (2023), FlashAttention-2. Shah et al. (2024), FlashAttention-3.
  • Leviathan et al. (2023), Fast Inference from Transformers via Speculative Decoding.
  • Chen et al. (2023), Accelerating Large Language Model Decoding with Speculative Sampling (DeepMind).
  • Frantar et al. (2022), GPTQ.
  • Lin et al. (2023), AWQ: Activation-aware Weight Quantization.

Important:

  • Pope et al. (2022), Efficiently Scaling Transformer Inference (Google) — the canonical compute/memory analysis.
  • Cai et al. (2024), Medusa.
  • Li et al. (2024), EAGLE / EAGLE-2.
  • The vLLM source code, especially vllm/core/scheduler.py and vllm/attention/.
  • HuggingFace's Optimizing LLM Inference blog series.

11. Common interview questions on Phase 9 material

  1. Compute the KV cache size for Llama-3 8B at 8k context.
  2. Why is decode memory-bandwidth-bound?
  3. Walk through PagedAttention. What problem does it solve?
  4. Walk through continuous batching. Why is it 5×+ vs static batching?
  5. Explain FlashAttention's online softmax.
  6. Implement scaled dot-product attention with KV cache for an autoregressive model.
  7. Difference between FP8 E4M3 and E5M2; when do you use each?
  8. Compare GPTQ vs AWQ.
  9. Speculative decoding's accept-reject probability — derive it.
  10. When does speculative decoding fail to help?
  11. Compare TP vs PP for inference.
  12. Design an LLM gateway for 100k QPS. (Bridges to system-design folder.)
  13. Why is prefix caching huge for chat APIs?
  14. Your decode latency is fine but throughput is low. What do you change?
  15. Sketch how you'd serve 100 different LoRA adapters on a single base model.

12. From solid → exceptional

  • Implement KV cache, then add PagedAttention in pure PyTorch. Batch across two requests of different lengths. Confirm memory savings.
  • Implement online softmax FlashAttention in CUDA / Triton. Triton makes this approachable.
  • Run vLLM on a 7B model; benchmark throughput vs naïve model.generate(). Aim to reproduce ~5× speedup numbers.
  • Quantize a 7B model with AWQ; benchmark BF16 vs INT4 throughput.
  • Implement speculative decoding with a 1B draft + 7B target; measure acceptance rate and wall-clock speedup at batch=1.
  • Read the entire vLLM scheduler.py and write a one-page explanation.
  • Build a gateway (the system-design exercise) — Go or Python — with OpenAI-compatible API, request queuing, multi-replica routing, prefix-aware routing.
  • Profile a real LLM forward pass with NVIDIA Nsight; identify the top 5 kernels by time.

DayActivity
MonRead PagedAttention paper + Pope et al. Efficiently Scaling Inference
TueRead FlashAttention-2 paper
WedLab 01 — implement KV cache; benchmark 40× speedup
ThuRead vLLM scheduler source; install vLLM; serve a 7B model
FriRead speculative decoding papers; implement on toy models
SatQuantize a 7B with AWQ; benchmark throughput
SunMock interview the 15 questions; whiteboard PagedAttention