Capstone 05 — Mini-vLLM: Build Your Own Inference Engine

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 3–5 weeks

Real-world parallel: vLLM, NVIDIA TensorRT-LLM, Hugging Face TGI, Together AI's serving stack, Anthropic's internal inference. The single most impactful capstone for Inference Engineer / Performance Engineer roles at frontier labs.

Goals

Build a production-grade LLM inference engine from scratch that can serve a 7B model with throughput within 2× of vLLM on a single GPU. Implement:

PagedAttention — block-based KV cache, no fragmentation, prefix sharing.
Continuous batching — new requests join the running batch at decode-step boundaries.
A scheduler — admission control, priority, preemption, recompute-on-evict.
OpenAI-compatible HTTP API — /v1/chat/completions (streaming + non-streaming).
Speculative decoding — small draft model verified by the target model.
Quantized weights — INT8/INT4 GPTQ or AWQ loader.
Benchmarks — throughput, p50/p95/p99 TTFT and ITL, vs vLLM as a reference.

Architecture

                ┌──────────────────────────────────────────────┐
                │ HTTP Server (FastAPI / uvicorn)              │
                │  - OpenAI-compatible /v1/chat/completions    │
                │  - SSE streaming                             │
                │  - Request validation, auth, rate-limit      │
                └─────────────────────┬────────────────────────┘
                                      ▼
                ┌──────────────────────────────────────────────┐
                │ Scheduler (the brain)                        │
                │  - Waiting / Running / Swapped queues        │
                │  - Per-step: prefill batch + decode batch    │
                │  - Preemption + recompute on cache pressure  │
                │  - Prefix-cache lookup                       │
                └─────────────────────┬────────────────────────┘
                                      ▼
   ┌──────────────────────────────────────────────────────────────────┐
   │ Model Runner                                                     │
   │  ┌──────────────────┐  ┌────────────────────┐  ┌──────────────┐ │
   │  │ Block Manager    │  │ Paged KV Cache     │  │ Sampler      │ │
   │  │  - free list     │  │  - phys blocks: 16 │  │  - greedy    │ │
   │  │  - block table   │  │    tokens each     │  │  - top-k/p   │ │
   │  │  - ref counts    │  │  - per-layer K, V  │  │  - temp      │ │
   │  │  - copy-on-write │  │  - INT8 optional   │  │  - logit bias│ │
   │  └──────────────────┘  └────────────────────┘  └──────────────┘ │
   │                                                                   │
   │  ┌────────────────────────────────────────────────────────────┐  │
   │  │ Forward (custom CUDA / FlashAttention-2 + paged attention) │  │
   │  │  - Prefill kernel (compute-bound, big tile)                │  │
   │  │  - Decode kernel (memory-bound, small batch)               │  │
   │  └────────────────────────────────────────────────────────────┘  │
   └──────────────────────────────────────────────────────────────────┘
                                      │
                                      ▼
   ┌──────────────────────────────────────────────────────────────────┐
   │ Speculative Decoding (optional layer)                            │
   │  - Draft model proposes K tokens                                 │
   │  - Target model verifies in one parallel pass                    │
   │  - Accept longest matching prefix                                │
   └──────────────────────────────────────────────────────────────────┘

  Observability: per-request trace, /metrics (Prometheus), GPU utilization

Suggested Stack

Component	Choice	Why
Language	Python + CUDA (or Triton)	Mirrors vLLM's stack
Model loader	safetensors + Hugging Face configs	Industry standard
Attention kernel	`flash-attn` (paged) OR write your own Triton	FA2 is the realistic choice
Quantization	GPTQ (`auto-gptq`) or AWQ (`autoawq`)	Both common
Draft model	TinyLlama-1.1B for Llama-7B target	5–7× smaller is the sweet spot
HTTP	FastAPI + uvicorn (or AIOHTTP)	OpenAI-compatible bindings already exist
Metrics	`prometheus_client`	Standard for serving infra
Reference	vLLM v0.5+ for benchmarking	The bar

Deliverables Checklist

engine/block_manager.py — physical block pool, allocate/free, ref-counts, copy-on-write for prefix sharing
engine/paged_attention.py — paged attention forward (prefill + decode kernels via FA2 or Triton)
engine/scheduler.py — request queues, batching policy, preemption, prefix-cache hits
engine/model_runner.py — Llama / Qwen forward with paged KV
engine/sampler.py — greedy, top-k, top-p, temperature, repetition penalty, logit bias
engine/spec_decode.py — draft + verify with longest-prefix accept
server/api.py — FastAPI OpenAI-compatible endpoints (chat, completions, models, health)
server/streaming.py — SSE token streaming
bench/throughput.py — sweep batch sizes, sequence lengths; output CSV + plot
bench/latency.py — p50/p95/p99 TTFT + ITL under concurrent load
bench/vs_vllm.md — head-to-head comparison report
Dockerfile + docker-compose.yml — one-command deploy
ARCHITECTURE.md — block diagram + scheduler state machine
WRITEUP.md — what each optimization bought you (in numbers)

Performance Targets

Metric	Target (Llama-7B BF16, 1× A100 80GB)
Throughput @ batch=64, seq=512 in / 256 out	≥ 2,000 tok/s
p50 TTFT @ 4 concurrent users	≤ 80 ms
p95 ITL @ 64 concurrent users	≤ 50 ms
KV memory utilization	≥ 85% (vs ~40% naive)
Spec decoding speedup (draft TinyLlama)	1.7–2.2× on chat workloads
Throughput vs vLLM v0.5 baseline	≥ 0.5× (within 2×)

Hitting all of these means you've earned interview signal at any inference team in the industry.

Resume Bullet Pattern

Built a production-grade LLM inference engine from scratch implementing PagedAttention, continuous batching, prefix caching, and speculative decoding; achieved 2,400 tok/s throughput on Llama-7B (1× A100) — 0.7× of vLLM v0.5 — with OpenAI-compatible HTTP API and Prometheus observability. [repo + benchmarks]

Interview Talking Points

PagedAttention math: virtual block tables, physical blocks, why 16-token blocks (compromise between fragmentation and metadata overhead).
Continuous batching: contrast with static / dynamic batching; how new prefills splice into a running decode batch every step.
Memory-bound decode: arithmetic intensity, why small batches are wasteful, why FlashAttention-2 helps (IO-aware tiling).
Prefix caching: copy-on-write semantics, ref-count lifecycle, when it's a 100× speedup (system-prompt-heavy workloads).
Preemption strategies: swap-to-CPU vs recompute-on-evict; vLLM uses recompute (cheaper at scale).
Speculative decoding: acceptance probability $\alpha$, expected speedup $(1-\alpha^{K+1})/((1-\alpha)(1+c \cdot K))$ where $c$ is draft cost ratio.
Scheduling fairness: head-of-line blocking, how iteration-level scheduling avoids it.
Quantization tradeoffs: GPTQ (post-hoc, small calibration set) vs AWQ (activation-aware, slightly better) vs SmoothQuant (W8A8); INT4 perplexity tax is ~1–3%.
The roofline: when you're compute-bound (prefill, large batch decode) vs memory-bound (small-batch decode); how to recognize from nsys profiles.

Getting Started

Build Phase-9 lab-01 first end-to-end. You need a working KV cache to extend.
Add a block manager (no kernel changes yet): split the cache into 16-token blocks; track free list + per-request block table.
Wire the scheduler: maintain waiting, running queues; per step, fill the batch up to max-batched-tokens.
Drop in FlashAttention-2 paged kernel (flash_attn.flash_attn_with_kvcache). Verify correctness against your naive path.
Implement OpenAI-compatible API. Run openai-python SDK against your server with base_url change — must work zero-mods.
Add prefix caching: hash the prompt prefix in 16-token windows; share blocks via copy-on-write.
Benchmark vs vLLM: same model, same inputs, same hardware. Document the gap honestly.
Add speculative decoding. Easiest win: TinyLlama-1.1B drafts for Llama-7B target. Tune K (draft length).
Add INT4 (GPTQ). Verify quality: perplexity within 5% of BF16 on WikiText.
Write the report. Plot every optimization's marginal improvement. This is what hiring managers read.

Stretch Goals

Multi-GPU: tensor parallelism (Megatron-style) across 2 GPUs.
Multi-LoRA serving: load N adapters on top of one base; route per request (S-LoRA paper).
FP8 (Hopper): H100/H200 only, but the highest-leverage modern optimization.
Chunked prefill: split very long prompts to keep TTFT bounded for other users.
Disaggregated prefill / decode: separate processes (or GPUs) per phase — the 2024 frontier (DistServe, Mooncake).
Custom Triton kernel: write your own paged attention from scratch in Triton; benchmark vs FA2.

What This Capstone Proves About You

You can read the vLLM source code and not feel intimidated — you wrote it. You can debug a serving bottleneck by reading an nsys trace. You can defend every design choice from first principles. You understand the difference between using an inference engine and building one.

This is the single most asked-about portfolio project for Inference Engineer, GPU Performance Engineer, Foundation Model Infra roles at Anthropic, OpenAI, Mistral, Together, Fireworks, Modal, NVIDIA, and any AI-first startup that runs its own models.

LLM Inference Engineer