The "LLM Inference Engineer" role exists because serving LLMs is fundamentally different from serving classical ML models — KV-cache memory grows with sequence length, batches have variable durations, and a single bad scheduling decision can 5× your cost. Companies pay senior salaries for engineers who can move TTFT from 800ms to 200ms.
This phase is where your distributed-systems background pays the highest dividend.
Add a KV-cache to your Phase 4 transformer; verify decode speedup; compute memory exactly.
Concepts
Why KV-cache (avoid recomputing past attention), memory budget, when KV-cache > parameters.
Steps
1) Add past_key_values to MultiHeadAttention.forward. 2) Make decode work step-by-step. 3) Benchmark generation latency with/without cache. 4) Compute KV-cache bytes for Llama-3-8B at seq=8192, batch=32.
Stack
PyTorch (your Phase 4 code)
Output
KV-cached generation function + a memory-math worksheet.
How to Test
Outputs are bit-equivalent with/without cache; latency drops by ≥ 10× on long sequences.
Talking Points
When KV-cache becomes the dominant memory consumer (long context, batch >> 1). The motivation for paged attention.
Resume Bullet
"Implemented KV-cache for a from-scratch decoder transformer; verified bit-equivalent outputs and 14× decode speedup at 1024-token contexts; produced exact memory-budget calculation for production-scale deployments."
bitsandbytes, auto-gptq, autoawq, transformers, your Phase 8 eval harness
Output
A 4×3 table: VRAM, throughput, MMLU.
How to Test
INT4 should fit in ~5 GB; MMLU drop < 2 points for AWQ.
Talking Points
Why AWQ tends to beat GPTQ on instruction-following models. Why activation outliers make naive INT8 hard. The role of calibration data.
Resume Bullet
"Quantized Llama-3-8B four ways (INT8, NF4, GPTQ-INT4, AWQ-INT4), producing a quality/throughput/memory tradeoff table: AWQ achieved 4.8 GB VRAM and 87 tok/s on a 4090 with <1.5-point MMLU drop."
Build a small inference server with continuous batching and SSE streaming.
Concepts
Static vs dynamic vs continuous batching; per-step admission/eviction; streaming protocol.
Steps
1) FastAPI server with /v1/completions. 2) Per-request queue. 3) Async batch worker that, every step, admits new requests and evicts finished ones (continuous batching). 4) Yield tokens via SSE. 5) Benchmark vs naive one-request-at-a-time.
Stack
FastAPI, asyncio, your KV-cached model from Lab 1 (or use HF generate as a starting point)
Output
A working server + a benchmark plot (throughput vs concurrency).
How to Test
Continuous batching delivers ≥ 3× throughput vs sequential at concurrency=16.
Talking Points
Why static batching wastes GPU on long-tail requests. The vLLM scheduling philosophy. The TTFT-vs-throughput tradeoff.
Resume Bullet
"Built an inference server with continuous batching and SSE streaming achieving 3.4× throughput improvement (118 → 401 tok/s aggregate) over naive serial serving at 16 concurrent clients."
1) vllm serve with Llama-3-8B-AWQ. 2) Benchmark with vllm.benchmark against your Lab 3 server. 3) Read vllm/core/scheduler.py and write a 200-word architecture summary. 4) Tune max-num-seqs, max-model-len.
Stack
vLLM, your Phase 8 eval pipeline
Output
A tuned config + comparison table vs your Lab 3 server.
How to Test
vLLM should handily beat your hand-rolled server.
Talking Points
Why PagedAttention solves KV fragmentation. Where vLLM's schedule decisions live. When to use TGI vs vLLM vs TensorRT-LLM.
Resume Bullet
"Deployed Llama-3-8B-AWQ with vLLM PagedAttention; tuned max-num-seqs and gpu-memory-utilization to achieve 1,420 tok/s sustained throughput at P99 TTFT 230 ms on a single A100-40 GB."
Extensions
Contribute a small fix or doc improvement to vLLM.
1) Pick a small draft (Qwen2-0.5B) and a large verifier (Qwen2-7B). 2) Implement speculative decode: draft K tokens, verify with one parallel forward, accept prefix. 3) Measure tokens/sec vs vanilla decode. 4) Compute acceptance rate.
Stack
transformers, custom code
Output
A spec_decode.py + a measurement table.
How to Test
Outputs distributionally identical to verifier alone (rejection sampling); speedup 1.5×–2.5×.
Talking Points
The math: speedup ≈ (1 - α^(K+1)) / ((1-α)(1 + cK)) where α=accept rate. Why spec decode preserves the verifier's distribution.
Resume Bullet
"Implemented speculative decoding using Qwen2-0.5B (draft) + Qwen2-7B (verify) achieving 2.1× decode throughput at 81% acceptance rate while preserving the verifier's exact output distribution."
Extensions
Try Medusa heads (no draft model needed); try lookahead decoding.