Phase 9 — Inference Optimization & Serving

Difficulty: ⭐⭐⭐⭐⭐ | Estimated Time: 2.5 weeks Roles supported: LLM Inference Engineer, ML Systems Engineer, Performance Engineer. Highest-leverage phase for infrastructure roles.


Why This Phase Exists

The "LLM Inference Engineer" role exists because serving LLMs is fundamentally different from serving classical ML models — KV-cache memory grows with sequence length, batches have variable durations, and a single bad scheduling decision can 5× your cost. Companies pay senior salaries for engineers who can move TTFT from 800ms to 200ms.

This phase is where your distributed-systems background pays the highest dividend.


Concepts

  • Decode loop anatomy: prefill vs decode phases
  • KV-cache memory math: 2 × n_layers × n_heads × d_head × seq_len × batch × dtype_bytes
  • Memory layout: contiguous vs paged (vLLM PagedAttention)
  • Static vs dynamic vs continuous batching
  • Request scheduling: FCFS, length-based, fairness
  • Prefix caching (system-prompt sharing)
  • Quantization:
    • INT8 weight-only (bitsandbytes)
    • INT4 GPTQ (group-wise, with calibration)
    • INT4 AWQ (activation-aware weight quantization)
    • NF4 (normal float, used by QLoRA)
    • FP8 (H100-specific)
  • Speculative decoding: draft model + verify (math of expected speedup)
  • Medusa heads, Lookahead decoding (overview)
  • FlashAttention-2 / FlashAttention-3 — what they fuse and why it wins
  • CUDA graphs for low-latency decode
  • TensorRT-LLM (overview)
  • Streaming via SSE / WebSocket
  • TTFT vs TPOT vs throughput (and why optimizing one can hurt others)

Labs

Lab 01 — KV-Cache From Scratch + Memory Math

FieldValue
GoalAdd a KV-cache to your Phase 4 transformer; verify decode speedup; compute memory exactly.
ConceptsWhy KV-cache (avoid recomputing past attention), memory budget, when KV-cache > parameters.
Steps1) Add past_key_values to MultiHeadAttention.forward. 2) Make decode work step-by-step. 3) Benchmark generation latency with/without cache. 4) Compute KV-cache bytes for Llama-3-8B at seq=8192, batch=32.
StackPyTorch (your Phase 4 code)
OutputKV-cached generation function + a memory-math worksheet.
How to TestOutputs are bit-equivalent with/without cache; latency drops by ≥ 10× on long sequences.
Talking PointsWhen KV-cache becomes the dominant memory consumer (long context, batch >> 1). The motivation for paged attention.
Resume Bullet"Implemented KV-cache for a from-scratch decoder transformer; verified bit-equivalent outputs and 14× decode speedup at 1024-token contexts; produced exact memory-budget calculation for production-scale deployments."
ExtensionsImplement Grouped-Query Attention (4× KV-cache memory reduction).

Lab 02 — Quantization: INT8 / INT4 / GPTQ / AWQ

FieldValue
GoalQuantize a 7B model 4 ways; measure quality vs memory vs latency.
ConceptsWeight-only vs activation quantization, calibration sets, group-size effects, accuracy degradation.
Steps1) Load Llama-3-8B. 2) Apply: bitsandbytes INT8, bitsandbytes NF4, GPTQ INT4 (auto-gptq), AWQ INT4 (autoawq). 3) Measure VRAM, decode tok/s, and MMLU on each.
Stackbitsandbytes, auto-gptq, autoawq, transformers, your Phase 8 eval harness
OutputA 4×3 table: VRAM, throughput, MMLU.
How to TestINT4 should fit in ~5 GB; MMLU drop < 2 points for AWQ.
Talking PointsWhy AWQ tends to beat GPTQ on instruction-following models. Why activation outliers make naive INT8 hard. The role of calibration data.
Resume Bullet"Quantized Llama-3-8B four ways (INT8, NF4, GPTQ-INT4, AWQ-INT4), producing a quality/throughput/memory tradeoff table: AWQ achieved 4.8 GB VRAM and 87 tok/s on a 4090 with <1.5-point MMLU drop."
ExtensionsTry FP8 on H100; compare smoothquant.

Lab 03 — Continuous Batching + Streaming Server

FieldValue
GoalBuild a small inference server with continuous batching and SSE streaming.
ConceptsStatic vs dynamic vs continuous batching; per-step admission/eviction; streaming protocol.
Steps1) FastAPI server with /v1/completions. 2) Per-request queue. 3) Async batch worker that, every step, admits new requests and evicts finished ones (continuous batching). 4) Yield tokens via SSE. 5) Benchmark vs naive one-request-at-a-time.
StackFastAPI, asyncio, your KV-cached model from Lab 1 (or use HF generate as a starting point)
OutputA working server + a benchmark plot (throughput vs concurrency).
How to TestContinuous batching delivers ≥ 3× throughput vs sequential at concurrency=16.
Talking PointsWhy static batching wastes GPU on long-tail requests. The vLLM scheduling philosophy. The TTFT-vs-throughput tradeoff.
Resume Bullet"Built an inference server with continuous batching and SSE streaming achieving 3.4× throughput improvement (118 → 401 tok/s aggregate) over naive serial serving at 16 concurrent clients."
ExtensionsAdd prefix caching for shared system prompts.

Lab 04 — vLLM / TGI Deep Dive

FieldValue
GoalDeploy a model with vLLM; understand its architecture; benchmark and tune.
ConceptsPagedAttention, scheduler, tensor parallelism, max-num-seqs, gpu-memory-utilization, swap space.
Steps1) vllm serve with Llama-3-8B-AWQ. 2) Benchmark with vllm.benchmark against your Lab 3 server. 3) Read vllm/core/scheduler.py and write a 200-word architecture summary. 4) Tune max-num-seqs, max-model-len.
StackvLLM, your Phase 8 eval pipeline
OutputA tuned config + comparison table vs your Lab 3 server.
How to TestvLLM should handily beat your hand-rolled server.
Talking PointsWhy PagedAttention solves KV fragmentation. Where vLLM's schedule decisions live. When to use TGI vs vLLM vs TensorRT-LLM.
Resume Bullet"Deployed Llama-3-8B-AWQ with vLLM PagedAttention; tuned max-num-seqs and gpu-memory-utilization to achieve 1,420 tok/s sustained throughput at P99 TTFT 230 ms on a single A100-40 GB."
ExtensionsContribute a small fix or doc improvement to vLLM.

Lab 05 — Speculative Decoding

FieldValue
GoalImplement speculative decoding with a small draft + a large verifier; measure speedup.
ConceptsDraft-then-verify, acceptance probability, expected speedup formula.
Steps1) Pick a small draft (Qwen2-0.5B) and a large verifier (Qwen2-7B). 2) Implement speculative decode: draft K tokens, verify with one parallel forward, accept prefix. 3) Measure tokens/sec vs vanilla decode. 4) Compute acceptance rate.
Stacktransformers, custom code
OutputA spec_decode.py + a measurement table.
How to TestOutputs distributionally identical to verifier alone (rejection sampling); speedup 1.5×–2.5×.
Talking PointsThe math: speedup ≈ (1 - α^(K+1)) / ((1-α)(1 + cK)) where α=accept rate. Why spec decode preserves the verifier's distribution.
Resume Bullet"Implemented speculative decoding using Qwen2-0.5B (draft) + Qwen2-7B (verify) achieving 2.1× decode throughput at 81% acceptance rate while preserving the verifier's exact output distribution."
ExtensionsTry Medusa heads (no draft model needed); try lookahead decoding.

Deliverables Checklist

  • KV-cached transformer + memory math
  • 4-way quantization comparison
  • Continuous-batching streaming server
  • vLLM deployment + tuning report
  • Speculative decoding implementation

Interview Relevance

This phase is the direct portfolio for LLM Inference Engineer roles.

  • "Walk me through KV-cache. What's its memory footprint?"
  • "Compare static / dynamic / continuous batching"
  • "Explain PagedAttention"
  • "Compare GPTQ and AWQ"
  • "Speculative decoding — derive the speedup"
  • System design: "Build a 100k-QPS inference gateway" (see system-design/)