Capstone 05 — Mini-vLLM: Build Your Own Inference Engine
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 3–5 weeks
Real-world parallel: vLLM, NVIDIA TensorRT-LLM, Hugging Face TGI, Together AI's serving stack, Anthropic's internal inference. The single most impactful capstone for Inference Engineer / Performance Engineer roles at frontier labs.
Goals
Build a production-grade LLM inference engine from scratch that can serve a 7B model with throughput within 2× of vLLM on a single GPU. Implement:
- PagedAttention — block-based KV cache, no fragmentation, prefix sharing.
- Continuous batching — new requests join the running batch at decode-step boundaries.
- A scheduler — admission control, priority, preemption, recompute-on-evict.
- OpenAI-compatible HTTP API —
/v1/chat/completions(streaming + non-streaming). - Speculative decoding — small draft model verified by the target model.
- Quantized weights — INT8/INT4 GPTQ or AWQ loader.
- Benchmarks — throughput, p50/p95/p99 TTFT and ITL, vs vLLM as a reference.
Architecture
┌──────────────────────────────────────────────┐
│ HTTP Server (FastAPI / uvicorn) │
│ - OpenAI-compatible /v1/chat/completions │
│ - SSE streaming │
│ - Request validation, auth, rate-limit │
└─────────────────────┬────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Scheduler (the brain) │
│ - Waiting / Running / Swapped queues │
│ - Per-step: prefill batch + decode batch │
│ - Preemption + recompute on cache pressure │
│ - Prefix-cache lookup │
└─────────────────────┬────────────────────────┘
▼
┌──────────────────────────────────────────────────────────────────┐
│ Model Runner │
│ ┌──────────────────┐ ┌────────────────────┐ ┌──────────────┐ │
│ │ Block Manager │ │ Paged KV Cache │ │ Sampler │ │
│ │ - free list │ │ - phys blocks: 16 │ │ - greedy │ │
│ │ - block table │ │ tokens each │ │ - top-k/p │ │
│ │ - ref counts │ │ - per-layer K, V │ │ - temp │ │
│ │ - copy-on-write │ │ - INT8 optional │ │ - logit bias│ │
│ └──────────────────┘ └────────────────────┘ └──────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Forward (custom CUDA / FlashAttention-2 + paged attention) │ │
│ │ - Prefill kernel (compute-bound, big tile) │ │
│ │ - Decode kernel (memory-bound, small batch) │ │
│ └────────────────────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────────────────┐
│ Speculative Decoding (optional layer) │
│ - Draft model proposes K tokens │
│ - Target model verifies in one parallel pass │
│ - Accept longest matching prefix │
└──────────────────────────────────────────────────────────────────┘
Observability: per-request trace, /metrics (Prometheus), GPU utilization
Suggested Stack
| Component | Choice | Why |
|---|---|---|
| Language | Python + CUDA (or Triton) | Mirrors vLLM's stack |
| Model loader | safetensors + Hugging Face configs | Industry standard |
| Attention kernel | flash-attn (paged) OR write your own Triton | FA2 is the realistic choice |
| Quantization | GPTQ (auto-gptq) or AWQ (autoawq) | Both common |
| Draft model | TinyLlama-1.1B for Llama-7B target | 5–7× smaller is the sweet spot |
| HTTP | FastAPI + uvicorn (or AIOHTTP) | OpenAI-compatible bindings already exist |
| Metrics | prometheus_client | Standard for serving infra |
| Reference | vLLM v0.5+ for benchmarking | The bar |
Deliverables Checklist
-
engine/block_manager.py— physical block pool, allocate/free, ref-counts, copy-on-write for prefix sharing -
engine/paged_attention.py— paged attention forward (prefill + decode kernels via FA2 or Triton) -
engine/scheduler.py— request queues, batching policy, preemption, prefix-cache hits -
engine/model_runner.py— Llama / Qwen forward with paged KV -
engine/sampler.py— greedy, top-k, top-p, temperature, repetition penalty, logit bias -
engine/spec_decode.py— draft + verify with longest-prefix accept -
server/api.py— FastAPI OpenAI-compatible endpoints (chat, completions, models, health) -
server/streaming.py— SSE token streaming -
bench/throughput.py— sweep batch sizes, sequence lengths; output CSV + plot -
bench/latency.py— p50/p95/p99 TTFT + ITL under concurrent load -
bench/vs_vllm.md— head-to-head comparison report -
Dockerfile+docker-compose.yml— one-command deploy -
ARCHITECTURE.md— block diagram + scheduler state machine -
WRITEUP.md— what each optimization bought you (in numbers)
Performance Targets
| Metric | Target (Llama-7B BF16, 1× A100 80GB) |
|---|---|
| Throughput @ batch=64, seq=512 in / 256 out | ≥ 2,000 tok/s |
| p50 TTFT @ 4 concurrent users | ≤ 80 ms |
| p95 ITL @ 64 concurrent users | ≤ 50 ms |
| KV memory utilization | ≥ 85% (vs ~40% naive) |
| Spec decoding speedup (draft TinyLlama) | 1.7–2.2× on chat workloads |
| Throughput vs vLLM v0.5 baseline | ≥ 0.5× (within 2×) |
Hitting all of these means you've earned interview signal at any inference team in the industry.
Resume Bullet Pattern
Built a production-grade LLM inference engine from scratch implementing PagedAttention, continuous batching, prefix caching, and speculative decoding; achieved 2,400 tok/s throughput on Llama-7B (1× A100) — 0.7× of vLLM v0.5 — with OpenAI-compatible HTTP API and Prometheus observability. [repo + benchmarks]
Interview Talking Points
- PagedAttention math: virtual block tables, physical blocks, why 16-token blocks (compromise between fragmentation and metadata overhead).
- Continuous batching: contrast with static / dynamic batching; how new prefills splice into a running decode batch every step.
- Memory-bound decode: arithmetic intensity, why small batches are wasteful, why FlashAttention-2 helps (IO-aware tiling).
- Prefix caching: copy-on-write semantics, ref-count lifecycle, when it's a 100× speedup (system-prompt-heavy workloads).
- Preemption strategies: swap-to-CPU vs recompute-on-evict; vLLM uses recompute (cheaper at scale).
- Speculative decoding: acceptance probability $\alpha$, expected speedup $(1-\alpha^{K+1})/((1-\alpha)(1+c \cdot K))$ where $c$ is draft cost ratio.
- Scheduling fairness: head-of-line blocking, how iteration-level scheduling avoids it.
- Quantization tradeoffs: GPTQ (post-hoc, small calibration set) vs AWQ (activation-aware, slightly better) vs SmoothQuant (W8A8); INT4 perplexity tax is ~1–3%.
- The roofline: when you're compute-bound (prefill, large batch decode) vs memory-bound (small-batch decode); how to recognize from
nsysprofiles.
Getting Started
- Build Phase-9 lab-01 first end-to-end. You need a working KV cache to extend.
- Add a block manager (no kernel changes yet): split the cache into 16-token blocks; track free list + per-request block table.
- Wire the scheduler: maintain
waiting,runningqueues; per step, fill the batch up to max-batched-tokens. - Drop in FlashAttention-2 paged kernel (
flash_attn.flash_attn_with_kvcache). Verify correctness against your naive path. - Implement OpenAI-compatible API. Run
openai-pythonSDK against your server withbase_urlchange — must work zero-mods. - Add prefix caching: hash the prompt prefix in 16-token windows; share blocks via copy-on-write.
- Benchmark vs vLLM: same model, same inputs, same hardware. Document the gap honestly.
- Add speculative decoding. Easiest win: TinyLlama-1.1B drafts for Llama-7B target. Tune K (draft length).
- Add INT4 (GPTQ). Verify quality: perplexity within 5% of BF16 on WikiText.
- Write the report. Plot every optimization's marginal improvement. This is what hiring managers read.
Stretch Goals
- Multi-GPU: tensor parallelism (Megatron-style) across 2 GPUs.
- Multi-LoRA serving: load N adapters on top of one base; route per request (S-LoRA paper).
- FP8 (Hopper): H100/H200 only, but the highest-leverage modern optimization.
- Chunked prefill: split very long prompts to keep TTFT bounded for other users.
- Disaggregated prefill / decode: separate processes (or GPUs) per phase — the 2024 frontier (DistServe, Mooncake).
- Custom Triton kernel: write your own paged attention from scratch in Triton; benchmark vs FA2.
What This Capstone Proves About You
You can read the vLLM source code and not feel intimidated — you wrote it. You can debug a serving bottleneck by reading an nsys trace. You can defend every design choice from first principles. You understand the difference between using an inference engine and building one.
This is the single most asked-about portfolio project for Inference Engineer, GPU Performance Engineer, Foundation Model Infra roles at Anthropic, OpenAI, Mistral, Together, Fireworks, Modal, NVIDIA, and any AI-first startup that runs its own models.