🛸 Hitchhiker's Guide — Phase 9: Inference Optimization & Serving

Read this if: You can train a 7B model but you don't yet know what a KV cache is, why batch size matters at inference, what continuous batching is doing, why FlashAttention matters, or how vLLM achieves 5–20× the throughput of naïve model.generate(). This is the most economically valuable phase of the curriculum: a 30% throughput gain on 1000 H100s saves millions per year.

0. The 30-second mental model

LLM inference has two fundamentally different phases per request:

Prefill (compute-bound): process the entire prompt in one pass. Computes the KV cache for every prompt token. FLOPs scale with prompt_length × params.
Decode (memory-bandwidth-bound): generate one token at a time. Reads the entire model weights + KV cache from HBM each step. Memory bandwidth, not FLOPs, is the bottleneck.

Throughput-optimal serving is about (a) keeping the GPU busy with continuous batching so decode steps aggregate many requests' work, (b) shrinking the KV cache with PagedAttention / quantization / GQA so you can fit a bigger batch, and (c) reducing the steps per request via speculative decoding.

By the end of Phase 9 you should:

Compute KV-cache memory and prefill/decode FLOPs by hand for any model.
Implement a KV cache from scratch in PyTorch (the lab does this).
Explain PagedAttention, continuous batching, prefix caching, chunked prefill.
Explain FlashAttention's online softmax.
Compare quantization formats (INT8, FP8 E4M3/E5M2, INT4 AWQ/GPTQ).
Explain speculative decoding's accept-reject math.
Be able to operate vLLM, TGI, or SGLang in production.

1. The two phases of LLM inference

1.1 Prefill

For a prompt of L tokens, run a forward pass over all L positions in parallel (just like training). Output: logits at the last position (for the next-token sampler) and the full KV cache for all L positions.

FLOPs ≈ 2 × L × N_params. Linear in prompt length.
Compute-bound on modern GPUs.
Latency: ~50ms for 1k-token prompt on 7B BF16 / H100.

1.2 Decode

For each subsequent token, run a forward pass on just one position (the new token), reading the cached K and V from previous positions. Output: one set of logits.

FLOPs ≈ 2 × N_params per token. Tiny per token.
Memory-bandwidth-bound: must read all N_params weight bytes and the entire KV cache from HBM each step.
Latency: ~30ms per token on 7B BF16 / H100 batch=1.

1.3 The arithmetic-intensity argument

Arithmetic intensity = FLOPs / bytes-read. H100 SXM:

Compute: ~989 TFLOP/s (BF16).
HBM bandwidth: ~3.35 TB/s.
Crossover intensity: 989e12 / 3.35e12 ≈ 295 FLOP/byte to be compute-bound.

For batch=1 decode: each weight byte produces 2 FLOPs. Intensity = 2. We're 100× memory-bound.

For batch=128 decode: each weight byte is reused across 128 requests. Intensity = 256. Now we're approaching compute-bound. This is the entire reason continuous batching exists.

2. The KV cache

2.1 What it stores

For each layer, each request, each prior token: the K and V projections. Shape per request:

(n_layers, 2, n_kv_heads, seq_len, head_dim)

With 2 for K and V. With GQA, n_kv_heads < n_query_heads, shrinking the cache.

2.2 The math you must know cold

KV cache size in bytes per request:

$$ \text{bytes} = 2 \cdot n_{\text{layers}} \cdot n_{\text{kv heads}} \cdot d_{\text{head}} \cdot \text{seq len} \cdot \text{bytes per element} $$

Worked example: Llama-3 8B (32 layers, 8 KV heads, 128 head dim, BF16 = 2 bytes), seq_len = 8192:

$$ 2 \cdot 32 \cdot 8 \cdot 128 \cdot 8192 \cdot 2 \approx 1.07 \text{ GB per request} $$

One request at 8k context costs over a gig of HBM beyond the model weights. You will be quizzed on this calculation.

2.3 Implementation pattern

Each Block keeps a LayerCache with K and V tensors that grow per step:

class LayerCache:
    def __init__(self, max_seq, n_kv_heads, head_dim, dtype, device):
        self.K = torch.zeros(max_seq, n_kv_heads, head_dim, dtype=dtype, device=device)
        self.V = torch.zeros(max_seq, n_kv_heads, head_dim, dtype=dtype, device=device)
        self.length = 0

In attention:

Prefill (T > 1): write all K, V for positions [0, T). Compute attention over [0, T).
Decode (T = 1): append one K, V at position length. Compute Q against K[:length+1], V[:length+1].

Lab 01 implements this end-to-end. Compare generate_no_cache (recomputes everything every step — O(T²) work per step) vs generate_kv_cache (O(T) work per step). The speedup at length 256 is ~50×.

2.4 PagedAttention (vLLM, Kwon et al., 2023)

The naïve KV cache pre-allocates (max_seq, ...) per request. This wastes memory: short requests waste their tail. PagedAttention treats the KV cache like virtual memory:

Split into fixed-size blocks (e.g., 16 tokens each).
Allocate blocks on demand from a pool.
A block table maps logical token positions to physical block addresses.
Attention kernel reads via the block table (one extra indirection).

Wins:

No internal fragmentation — only allocate what's used.
Prefix sharing — multiple requests sharing a system prompt share the same physical blocks (copy-on-write semantics).
2–4× throughput vs naïve, because you can fit more concurrent requests.

2.5 Prefix caching

Common in chat APIs: many requests share a long system prompt. Hash the prompt's KV cache; reuse on the next request. vLLM's --enable-prefix-caching. Massive win for chat assistants with long instructions.

3. Continuous batching

3.1 The problem with static batching

Static batching: collect N requests, batch them, run prefill+decode together until all N finish. The slowest request blocks the entire batch from completing. GPU goes idle as faster requests finish but slots can't be refilled.

3.2 The continuous-batching solution (Yu et al., Orca, 2022)

Schedule at the token level. After every decode step, decisions are remade:

A request that just completed (hit EOS) frees its slot.
A new request waiting in the queue can be inserted mid-batch by adding its prefill on the side.
Different requests in the batch can be at different sequence positions — masking handles correctness.

Throughput improvement vs static batching: typically 3–10×. This is the algorithm that makes modern serving viable.

3.3 Chunked prefill

Long prompts have expensive prefills (compute-bound) that block decodes (memory-bound) from the rest of the batch. Chunked prefill splits a long prefill into chunks of chunk_size (e.g., 512 tokens) and interleaves them with decode steps from other requests, keeping both compute and memory utilized. Used by SGLang, recent vLLM versions.

3.4 Configuring vLLM

Two knobs that matter most:

max_num_seqs: max concurrent requests in the batch.
max_num_batched_tokens: total token budget per scheduling step (sum of chunk-prefill tokens + decode tokens).

Tune for your workload — chat (mostly decode) wants high max_num_seqs; batch-translation (mostly prefill) wants high max_num_batched_tokens.

4. FlashAttention

4.1 The problem

Standard attention materializes the (T, T) score matrix in HBM, then softmaxes, then matmuls with V. For T = 8192, that's a 256 MB tensor read+written per layer per head. Memory bandwidth is the bottleneck even when FLOPs would fit.

4.2 The trick — tiled, fused, online softmax

FlashAttention (Dao et al., 2022) tiles Q and K, V into blocks that fit in SRAM, and computes the softmax incrementally using the online softmax algorithm:

For each Q tile:
  initialize running max m = -inf, running denominator l = 0, running output o = 0
  For each K, V tile:
    s = q · k          # block of scores
    m_new = max(m, max(s))
    correction = exp(m - m_new)
    l = l * correction + sum(exp(s - m_new))
    o = o * correction + exp(s - m_new) @ v
    m = m_new
  o = o / l

The full (T, T) matrix is never materialized. HBM reads/writes drop ~10×. Wall-clock attention drops 2–4×. Memory drops from O(T²) to O(T).

4.3 v1 → v2 → v3

v1 (2022): the original. Fused softmax + matmul.
v2 (2023): better work partitioning across SMs; ~2× faster than v1.
v3 (2024): Hopper-specific (TMA, FP8 paths, asynchronous copy). ~1.5× faster than v2 on H100.

4.4 PyTorch integration

torch.nn.functional.scaled_dot_product_attention dispatches to FlashAttention v2 when shapes/dtype permit. Always use this in modern code rather than hand-rolling.

5. Quantization

5.1 Why quantize

Smaller weights → less HBM bandwidth required per decode step → faster decode. Also lets you fit bigger models on smaller GPUs.

5.2 Datatypes

Format	Bits	Notes
FP16	16	Training default for older models
BF16	16	Training default modern
INT8 (W8A8)	8	Weights and activations both quantized
FP8 E4M3	8	H100+; small range, more precision; for weights/activations
FP8 E5M2	8	Wider range, lower precision; for gradients
INT4 (W4A16)	4	Weights only; activations stay BF16. AWQ/GPTQ
NF4	4	Information-optimal for normal-distributed weights (QLoRA)
INT2/Ternary	2	Aggressive; quality drops

5.3 PTQ vs QAT

Post-Training Quantization (PTQ): quantize a trained model with a small calibration set (~512 samples). Fast. Easy. GPTQ and AWQ are PTQ.
Quantization-Aware Training (QAT): simulate quantization during training so weights adapt. More expensive, sometimes higher quality.

5.4 GPTQ vs AWQ

GPTQ (Frantar et al., 2022): row-by-row second-order quantization minimizing reconstruction error. Slightly slower to apply, slightly better accuracy.
AWQ (Lin et al., 2023): observes that ~1% of weights matter most for output. Scales those weights up before quantization (with a corresponding inverse scale on activations). Good balance.

5.5 What breaks at 4-bit

The LM head is sensitive — keep it in BF16/FP16.
Embedding layer sometimes too.
For very small models (<1B), 4-bit quality degrades faster than for large models.

5.6 Smoothing for FP8

FP8 (especially E4M3 with mantissa precision 3) needs careful per-tensor or per-block scaling. Tools: NVIDIA transformer_engine. Active research area.

6. Speculative decoding

6.1 The trick

Decode is sequential — one token per forward pass. What if a smaller draft model could propose k tokens cheaply, and the big target model verifies them all in one parallel forward pass?

6.2 The accept/reject algorithm (Leviathan et al., 2023; Chen et al., 2023)

Draft proposes tokens t_1, …, t_k with probabilities q(t_i | context). Target evaluates the same positions in parallel, getting p(t_i | context).

For each i, accept t_i with probability:

$$ \min!\left(1, \frac{p(t_i)}{q(t_i)}\right) $$

If rejected: sample a replacement from the residual distribution max(0, p - q)/(1 - sum_acc) and stop. This procedure is provably equivalent to sampling from the target — same distribution, same temperature semantics.

Best case: all k tokens accepted → k tokens generated for ~1 target forward. Speedup ~2–3× wall clock when draft is well-aligned with target.

6.3 Variants

Medusa (Cai et al., 2024) — train extra "Medusa heads" on the target model itself to predict multiple future tokens. No separate draft model.
EAGLE / EAGLE-2 (Li et al., 2024) — autoregressive draft head trained to mimic target hidden states; better acceptance rates.
Lookahead decoding — n-gram heuristics; no extra training.

6.4 When does it help?

Most beneficial at batch=1 (single user, low concurrency).
Diminishing returns at high batch size — your batch is already amortizing the memory bandwidth cost.
Doesn't help at all if draft is poorly aligned (acceptance rate too low).

7. Other inference tricks

7.1 Tensor / Pipeline parallelism for inference

If a model doesn't fit on one GPU, shard it across several:

Tensor parallel (TP): split each weight matrix across GPUs (column-parallel for QKV+gate+up; row-parallel for O+down). Requires AllReduce per layer. Use within a single node (NVLink). vLLM --tensor-parallel-size 8.
Pipeline parallel (PP): assign different layers to different GPUs. Useful across nodes but introduces bubble overhead.

7.2 Speculative batching, multi-LoRA

Serve many LoRA adapters from a single base model, switching per request (Punica, S-LoRA).
Useful for per-tenant fine-tunes.

7.3 Streaming output

Always stream tokens to the client. Improves perceived latency dramatically. SSE or WebSockets at the gateway.

7.4 Long-context tricks

YaRN / NTK-aware RoPE scaling — extend a 4k-trained model to 32k+ at inference (covered Phase 4).
Sliding window attention (Mistral 7B) — only attend to last W tokens; bounded KV cache.
Ring Attention — distributed sequence parallelism for million-token contexts.

8. The lab walkthrough (lab-01-kv-cache-from-scratch)

8.1 What you'll build

A self-contained mini-GPT (similar to Phase 4) plus:

A LayerCache with K and V buffers and a length counter.
Modified CausalSelfAttention that accepts an optional cache; if present, appends new K, V at length, computes attention against the prefix [0, length+1).
generate_no_cache(model, prompt, max_new) — naive baseline that re-runs the full forward over the growing context every step.
generate_kv_cache(model, prompt, max_new) — uses the cache; prefill once, then incremental decode.
A timing benchmark comparing the two.

8.2 Things to read carefully

Why is the prefill case T > 1 and incremental decode T = 1? Because at prefill we have the whole prompt, at decode just the new token.
The new K, V positions are computed at indices [length, length + T).
The Q for position length attends to K at [0, length+1) — a causal mask is not needed at decode (only one query, all keys are by construction prior).
Why is batching multiple requests with a unified KV cache hard in pure PyTorch? (Different lengths — you'd need padding or PagedAttention. Why vLLM exists.)

8.3 Expected numbers

For a 6-layer, 384-dim model on a 4090:

256 tokens generated:
generate_no_cache: ~3000ms (work scales as O(T²)).
generate_kv_cache: ~80ms (O(T)).
Speedup: ~40×.

8.4 Optional extensions

Add prefix sharing — two requests with the same prompt prefix share the same cache rows.
Implement a block-paged KV cache.
Implement temperature + top-p sampling.
Add a tiny draft model and implement speculative decoding against the target.

Kwon et al. (2023), Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM paper).
Yu et al. (2022), Orca: A Distributed Serving System for Transformer-Based Generative Models (continuous batching).
Dao et al. (2022), FlashAttention. Dao (2023), FlashAttention-2. Shah et al. (2024), FlashAttention-3.
Leviathan et al. (2023), Fast Inference from Transformers via Speculative Decoding.
Chen et al. (2023), Accelerating Large Language Model Decoding with Speculative Sampling (DeepMind).
Frantar et al. (2022), GPTQ.
Lin et al. (2023), AWQ: Activation-aware Weight Quantization.

Important:

Pope et al. (2022), Efficiently Scaling Transformer Inference (Google) — the canonical compute/memory analysis.
Cai et al. (2024), Medusa.
Li et al. (2024), EAGLE / EAGLE-2.
The vLLM source code, especially vllm/core/scheduler.py and vllm/attention/.
HuggingFace's Optimizing LLM Inference blog series.

11. Common interview questions on Phase 9 material

Compute the KV cache size for Llama-3 8B at 8k context.
Why is decode memory-bandwidth-bound?
Walk through PagedAttention. What problem does it solve?
Walk through continuous batching. Why is it 5×+ vs static batching?
Explain FlashAttention's online softmax.
Implement scaled dot-product attention with KV cache for an autoregressive model.
Difference between FP8 E4M3 and E5M2; when do you use each?
Compare GPTQ vs AWQ.
Speculative decoding's accept-reject probability — derive it.
When does speculative decoding fail to help?
Compare TP vs PP for inference.
Design an LLM gateway for 100k QPS. (Bridges to system-design folder.)
Why is prefix caching huge for chat APIs?
Your decode latency is fine but throughput is low. What do you change?
Sketch how you'd serve 100 different LoRA adapters on a single base model.

12. From solid → exceptional

Implement KV cache, then add PagedAttention in pure PyTorch. Batch across two requests of different lengths. Confirm memory savings.
Implement online softmax FlashAttention in CUDA / Triton. Triton makes this approachable.
Run vLLM on a 7B model; benchmark throughput vs naïve model.generate(). Aim to reproduce ~5× speedup numbers.
Quantize a 7B model with AWQ; benchmark BF16 vs INT4 throughput.
Implement speculative decoding with a 1B draft + 7B target; measure acceptance rate and wall-clock speedup at batch=1.
Read the entire vLLM scheduler.py and write a one-page explanation.
Build a gateway (the system-design exercise) — Go or Python — with OpenAI-compatible API, request queuing, multi-replica routing, prefix-aware routing.
Profile a real LLM forward pass with NVIDIA Nsight; identify the top 5 kernels by time.

13. Recommended cadence

Day	Activity
Mon	Read PagedAttention paper + Pope et al. Efficiently Scaling Inference
Tue	Read FlashAttention-2 paper
Wed	Lab 01 — implement KV cache; benchmark 40× speedup
Thu	Read vLLM scheduler source; install vLLM; serve a 7B model
Fri	Read speculative decoding papers; implement on toy models
Sat	Quantize a 7B with AWQ; benchmark throughput
Sun	Mock interview the 15 questions; whiteboard PagedAttention

LLM Inference Engineer