Phase 9 — Inference Optimization & Serving

Difficulty: ⭐⭐⭐⭐⭐ | Estimated Time: 2.5 weeks Roles supported: LLM Inference Engineer, ML Systems Engineer, Performance Engineer. Highest-leverage phase for infrastructure roles.

Why This Phase Exists

The "LLM Inference Engineer" role exists because serving LLMs is fundamentally different from serving classical ML models — KV-cache memory grows with sequence length, batches have variable durations, and a single bad scheduling decision can 5× your cost. Companies pay senior salaries for engineers who can move TTFT from 800ms to 200ms.

This phase is where your distributed-systems background pays the highest dividend.

Concepts

Decode loop anatomy: prefill vs decode phases
KV-cache memory math: 2 × n_layers × n_heads × d_head × seq_len × batch × dtype_bytes
Memory layout: contiguous vs paged (vLLM PagedAttention)
Static vs dynamic vs continuous batching
Request scheduling: FCFS, length-based, fairness
Prefix caching (system-prompt sharing)
Quantization:
- INT8 weight-only (bitsandbytes)
- INT4 GPTQ (group-wise, with calibration)
- INT4 AWQ (activation-aware weight quantization)
- NF4 (normal float, used by QLoRA)
- FP8 (H100-specific)
Speculative decoding: draft model + verify (math of expected speedup)
Medusa heads, Lookahead decoding (overview)
FlashAttention-2 / FlashAttention-3 — what they fuse and why it wins
CUDA graphs for low-latency decode
TensorRT-LLM (overview)
Streaming via SSE / WebSocket
TTFT vs TPOT vs throughput (and why optimizing one can hurt others)

Labs

Lab 01 — KV-Cache From Scratch + Memory Math

Field	Value
Goal	Add a KV-cache to your Phase 4 transformer; verify decode speedup; compute memory exactly.
Concepts	Why KV-cache (avoid recomputing past attention), memory budget, when KV-cache > parameters.
Steps	1) Add `past_key_values` to `MultiHeadAttention.forward`. 2) Make decode work step-by-step. 3) Benchmark generation latency with/without cache. 4) Compute KV-cache bytes for Llama-3-8B at seq=8192, batch=32.
Stack	PyTorch (your Phase 4 code)
Output	KV-cached generation function + a memory-math worksheet.
How to Test	Outputs are bit-equivalent with/without cache; latency drops by ≥ 10× on long sequences.
Talking Points	When KV-cache becomes the dominant memory consumer (long context, batch >> 1). The motivation for paged attention.
Resume Bullet	"Implemented KV-cache for a from-scratch decoder transformer; verified bit-equivalent outputs and 14× decode speedup at 1024-token contexts; produced exact memory-budget calculation for production-scale deployments."
Extensions	Implement Grouped-Query Attention (4× KV-cache memory reduction).

Lab 02 — Quantization: INT8 / INT4 / GPTQ / AWQ

Field	Value
Goal	Quantize a 7B model 4 ways; measure quality vs memory vs latency.
Concepts	Weight-only vs activation quantization, calibration sets, group-size effects, accuracy degradation.
Steps	1) Load Llama-3-8B. 2) Apply: bitsandbytes INT8, bitsandbytes NF4, GPTQ INT4 (`auto-gptq`), AWQ INT4 (`autoawq`). 3) Measure VRAM, decode tok/s, and MMLU on each.
Stack	`bitsandbytes`, `auto-gptq`, `autoawq`, `transformers`, your Phase 8 eval harness
Output	A 4×3 table: VRAM, throughput, MMLU.
How to Test	INT4 should fit in ~5 GB; MMLU drop < 2 points for AWQ.
Talking Points	Why AWQ tends to beat GPTQ on instruction-following models. Why activation outliers make naive INT8 hard. The role of calibration data.
Resume Bullet	"Quantized Llama-3-8B four ways (INT8, NF4, GPTQ-INT4, AWQ-INT4), producing a quality/throughput/memory tradeoff table: AWQ achieved 4.8 GB VRAM and 87 tok/s on a 4090 with <1.5-point MMLU drop."
Extensions	Try FP8 on H100; compare smoothquant.

Lab 03 — Continuous Batching + Streaming Server

Field	Value
Goal	Build a small inference server with continuous batching and SSE streaming.
Concepts	Static vs dynamic vs continuous batching; per-step admission/eviction; streaming protocol.
Steps	1) FastAPI server with `/v1/completions`. 2) Per-request queue. 3) Async batch worker that, every step, admits new requests and evicts finished ones (continuous batching). 4) Yield tokens via SSE. 5) Benchmark vs naive one-request-at-a-time.
Stack	FastAPI, asyncio, your KV-cached model from Lab 1 (or use HF generate as a starting point)
Output	A working server + a benchmark plot (throughput vs concurrency).
How to Test	Continuous batching delivers ≥ 3× throughput vs sequential at concurrency=16.
Talking Points	Why static batching wastes GPU on long-tail requests. The vLLM scheduling philosophy. The TTFT-vs-throughput tradeoff.
Resume Bullet	"Built an inference server with continuous batching and SSE streaming achieving 3.4× throughput improvement (118 → 401 tok/s aggregate) over naive serial serving at 16 concurrent clients."
Extensions	Add prefix caching for shared system prompts.

Lab 04 — vLLM / TGI Deep Dive

Field	Value
Goal	Deploy a model with vLLM; understand its architecture; benchmark and tune.
Concepts	PagedAttention, scheduler, tensor parallelism, max-num-seqs, gpu-memory-utilization, swap space.
Steps	1) `vllm serve` with Llama-3-8B-AWQ. 2) Benchmark with `vllm.benchmark` against your Lab 3 server. 3) Read `vllm/core/scheduler.py` and write a 200-word architecture summary. 4) Tune `max-num-seqs`, `max-model-len`.
Stack	vLLM, your Phase 8 eval pipeline
Output	A tuned config + comparison table vs your Lab 3 server.
How to Test	vLLM should handily beat your hand-rolled server.
Talking Points	Why PagedAttention solves KV fragmentation. Where vLLM's schedule decisions live. When to use TGI vs vLLM vs TensorRT-LLM.
Resume Bullet	"Deployed Llama-3-8B-AWQ with vLLM PagedAttention; tuned `max-num-seqs` and `gpu-memory-utilization` to achieve 1,420 tok/s sustained throughput at P99 TTFT 230 ms on a single A100-40 GB."
Extensions	Contribute a small fix or doc improvement to vLLM.

Lab 05 — Speculative Decoding

Field	Value
Goal	Implement speculative decoding with a small draft + a large verifier; measure speedup.
Concepts	Draft-then-verify, acceptance probability, expected speedup formula.
Steps	1) Pick a small draft (Qwen2-0.5B) and a large verifier (Qwen2-7B). 2) Implement speculative decode: draft K tokens, verify with one parallel forward, accept prefix. 3) Measure tokens/sec vs vanilla decode. 4) Compute acceptance rate.
Stack	`transformers`, custom code
Output	A `spec_decode.py` + a measurement table.
How to Test	Outputs distributionally identical to verifier alone (rejection sampling); speedup 1.5×–2.5×.
Talking Points	The math: speedup ≈ (1 - α^(K+1)) / ((1-α)(1 + cK)) where α=accept rate. Why spec decode preserves the verifier's distribution.
Resume Bullet	"Implemented speculative decoding using Qwen2-0.5B (draft) + Qwen2-7B (verify) achieving 2.1× decode throughput at 81% acceptance rate while preserving the verifier's exact output distribution."
Extensions	Try Medusa heads (no draft model needed); try lookahead decoding.

Deliverables Checklist

KV-cached transformer + memory math
4-way quantization comparison
Continuous-batching streaming server
vLLM deployment + tuning report
Speculative decoding implementation

Interview Relevance

This phase is the direct portfolio for LLM Inference Engineer roles.

"Walk me through KV-cache. What's its memory footprint?"
"Compare static / dynamic / continuous batching"
"Explain PagedAttention"
"Compare GPTQ and AWQ"
"Speculative decoding — derive the speedup"
System design: "Build a 100k-QPS inference gateway" (see system-design/)