🛸 Hitchhiker's Guide — Phase 11: Capstone

Read this if: You finished Phases 1–10 and now you need to prove to a hiring committee — in 60 seconds, in a one-page README, and in a 45-minute deep-dive interview — that you actually understand all of it. The capstone is the artifact you'll point to for the next 5 years of your career.

0. The 30-second mental model

A capstone project is not a tutorial reproduction. It's a complete system that:

Uses every layer of the stack you learned (data → train/fine-tune → eval → serve) end-to-end.
Has measurable, defensible numbers — throughput, perplexity, eval scores, latency percentiles — that you can cite in any interview.
Is shippable: someone clones the repo, runs make, and gets a working system.
Tells a story: the README opens with a clear problem, your tradeoffs, your numbers, and one architectural diagram.
Is honestly yours — when interviewers grill you on a design choice, you can defend every line.

By the end of Phase 11 you should have:

Picked one capstone path and shipped it.
A README.md that earns "let's interview them" from a senior+ AI engineer in <2 minutes of reading.
A 1-paragraph version, a 1-page version, and a 30-minute deep-dive version of the project, all rehearsed.

1. The four canonical capstone paths

Pick one. Don't try two. A finished single project crushes two half-baked ones.

Path A — "I built a 1B-parameter LLM from scratch"

The Karpathy-disciple play. Highest compounding learning, biggest interview impression because almost nobody has done it.

Scope:

Data: 50–100GB filtered text (your Phase 10 pipeline output).
Model: ~350M to 1B params, GQA, RoPE, SwiGLU, RMSNorm, weight-tied LM head.
Train: 50–200B tokens with WSD or cosine schedule, BF16, FSDP across 4–8 GPUs.
Eval: lm-evaluation-harness on HellaSwag, ARC-easy, PIQA, WinoGrande. Compare to Pythia at matched param count.
Serve: vLLM-compatible weights export.

Realistic compute: ~$2–8k of cloud compute (8× A100/H100 spot for ~3–7 days). Or use the Together / Lambda / Vast.ai discount tracks. Document this honestly — most reviewers respect the cost discipline.

What stands out: matching or beating a published model at equal compute. Reproducing a known result (e.g., Pythia-410M's HellaSwag) within 1% is enough.

Path B — "I built a production-grade inference gateway"

The systems engineer play. Safest, most legibly valuable to product teams.

Scope:

Frontend: OpenAI-compatible HTTP/SSE endpoint (/v1/chat/completions, /v1/completions, /v1/embeddings).
Backend: vLLM (or your own KV-cache server from Phase 9).
Features: continuous batching observation, prefix caching, multi-replica routing with prefix-aware load balancing, per-tenant rate limiting, structured-output (JSON-schema) constrained decoding.
Observability: Prometheus metrics, latency histograms (TTFT, ITL, total), GPU utilization, prefix-cache hit rate.
Eval: published throughput numbers (req/sec, tokens/sec) at multiple QPS; latency percentiles.
Stretch: K8s manifests, autoscaling, blue/green deploy.

Realistic compute: 1× cheap GPU (4090, A10) for the demo. Production-grade simulator drives traffic.

What stands out: real benchmark numbers for your gateway vs naive model.generate(), with a graph showing the throughput cliff being smoothed by continuous batching.

Path C — "I built a fine-tuning + serving platform"

The MLOps play. Useful for staff/principal roles.

Scope:

UI / CLI to upload (prompt, response) JSONL.
Backend: queues a QLoRA job on a GPU pool; monitors loss; saves checkpoints.
Eval gate: runs MT-Bench-style LLM-judge eval after each checkpoint; promotes best.
Serve: hot-swap LoRA adapters per tenant; serve from a single base model.
Observability + cost accounting per tenant.

Realistic compute: 1× A100/H100 (rented per session).

What stands out: showing a complete, documented loop including the eval-gate decision and a per-tenant cost report.

Path D — "I built a real RAG product"

The applied-AI / startup play. Easiest to demo to non-technical interviewers.

Scope:

Ingestion: real corpus (your company's docs, a Wikipedia subset, arXiv abstracts).
Pipeline: structural chunker → embed (BGE / E5) → Qdrant.
Retrieval: BM25 + dense + RRF + cross-encoder reranker.
Generation: streaming SSE with citations.
Eval: RAGAS suite on a 100-item golden set; published numbers.
Frontend: a real React/Next.js UI (3 hours of work, hugely improves demo).
Stretch: agent loop with tool calling (search + calculator + code-exec).

Realistic compute: $0 (CPU embed-then-cache + small LLM via Together API or Anthropic API).

What stands out: actual user-quality demos, RAGAS deltas before/after each pipeline addition (e.g., "+5.2% faithfulness from adding the cross-encoder").

2. Picking your path

If you want to interview at...	Pick
Frontier lab research (Anthropic, OpenAI, DeepMind, Meta FAIR)	A or B
Inference startup (Together, Anyscale, Anthropic engineering)	B
Hyperscaler ML platform team (Google, AWS, Azure ML)	C
Applied AI / startup engineer	D (with B as supporting work)
Hedge fund / quant (LLM tooling teams)	B or C

If you can't decide: Path B. It's the broadest, the most economically valuable, and the one with the lowest risk of "infinite scope" failure.

3. The README — your single most important deliverable

A great capstone README is 3–5 pages, in this order:

One-line description: "A vLLM-compatible inference gateway with continuous batching and prefix-aware routing achieving 4.7× the throughput of naïve serving on a single A100."
30-second video / GIF demo (loom screencast or asciicast).
Architecture diagram: hand-drawn or excalidraw is fine; it must be on one slide at a glance.
Quickstart: 5 lines of bash that get a reviewer running locally or in the cloud.
Numbers: a table of the headline benchmark, with conditions documented.
What was hard: 2–3 paragraphs of "the bug that took me a week".
What I'd do next: 1 paragraph showing direction.
Tech stack + References to papers/repos that informed the design.

Common mistakes:

❌ A wall of feature bullets with no metrics.
❌ A "todo" list at the bottom that screams "unfinished".
❌ Placeholder Lorem ipsum or unfilled template sections.
❌ No way for a reviewer to actually run it.
❌ No mention of cost or compute used.

4. The 60-second pitch

Memorize this. Practice it out loud.

"I built X — [one sentence]. The technical challenge was Y — [one sentence on the core constraint]. My approach was Z — [one sentence on the key design choice]. The numbers came out at N — [one sentence with a concrete metric]. The thing I'm proudest of is W — [one sentence showing technical depth]."

Example, Path B:

"I built an OpenAI-compatible inference gateway on top of vLLM that adds prefix-aware routing across replicas. The challenge was that naive round-robin breaks vLLM's prefix cache, hurting throughput on chat workloads. My approach was a stateful router that hashes the system-prompt prefix and pins requests to the same backend. On a 4-replica setup serving Llama-3-8B at 50 QPS, this raised the prefix-cache hit rate from 8% to 71%, lowering p99 TTFT from 1.4s to 290ms. The thing I'm proudest of is the load-balancing tie-breaker that prevents one replica from becoming a hotspot when many users share the same prompt — I documented this with a load-imbalance metric and a chaos test."

5. The 30-minute deep-dive interview

What a senior+ engineer will probe:

Why this design and not the alternative? Have a defensible reason for every choice. ("I picked Qdrant because it has payload filtering and is easier to ops than Vespa for a one-person project.")
Where does it fail? Be honest about limitations. Show you thought about edge cases.
What numbers can you cite? Have your benchmark methodology memorized. Be ready to discuss conditions, statistical noise, error bars.
Walk me through the most interesting bug. This is the one question every senior+ asks. Have a great answer rehearsed.
How does this scale to 100×? Be ready to discuss what would break first (memory, comm, comm-comm overlap, observability, on-call burden).
What's the next thing you'd add? Show product/engineering judgment, not just feature lust.

6. Your weekly cadence to a finished capstone

This is intense. Compress as needed.

Week	Goal
1	Pick path. Write README skeleton (yes, write it before coding). Ship a "hello world" version that does the smallest end-to-end thing.
2	Replace placeholders with real components. Get one real query through the whole pipeline.
3	Add the metric harness. Capture initial numbers (they will be bad — that's fine).
4	Optimize the biggest bottleneck. Document before/after numbers.
5	Add the second-biggest improvement. Document.
6	Eval gate, observability, ops polish.
7	Write up README; record demo; rehearse the 60s pitch and the 30-min deep dive with a friend.

7. References for the capstone meta-skill

Karpathy's nanoGPT — the gold standard for "small but complete" LLM projects.
vLLM's project README — gold standard for inference systems README.
Anthropic's blog on building with LLMs — for the prose style of "this is the system, here are the choices, here are the numbers".
Designing Data-Intensive Applications (Kleppmann) — for systems vocabulary you'll be expected to use.
The Pragmatic Programmer — for shipping discipline.
Will Larson's Staff Engineer book — for the storytelling that promotion / staff+ roles demand.
Cal Newport, Deep Work — the meta-skill of doing seven weeks of high-focus output.

8. Common interview questions about your capstone

Walk me through your project end-to-end in 5 minutes.
What's the single biggest design choice you made and why?
Tell me about the hardest bug you fixed.
What numbers did you measure, and how did you measure them rigorously?
If you had 10× the budget, what would you change?
Where does your system fail?
How would you scale this to 1000× the load?
If a junior engineer joined you, what's the first thing you'd hand off?
Compare your approach to [vLLM / LangChain / nanoGPT / etc.]. Why didn't you just use that?
In hindsight, what would you do differently?

9. From solid → exceptional capstones

Open-source it with a permissive license, real CI, real tests, real issues, real PRs.
Write a blog post explaining the most interesting technical choice. Submit to HackerNews / Reddit /r/LocalLLaMA. A few hundred upvotes is portfolio-defining.
Reproduce a known number: nanoGPT's GPT-2 124M perplexity on OpenWebText, vLLM's published throughput on Llama-3-8B, the Llama paper's HellaSwag. Match within 5%. Cite both your number and the reference number.
Write a one-page architecture decision record (ADR) for each major choice. Hiring managers love these.
Cross-link with the rest of the curriculum: the README should reference the system-design walkthroughs and interview-prep cheatsheets you wrote.
Have a public, working demo URL. Even a $5/month VPS with auth-gated access counts.

10. Final checklist before saying "done"

One-line description in the README that a non-AI engineer understands.
A diagram on one screen.
Quickstart that runs in <5 minutes.
A headline number with conditions.
An honest "limitations" section.
A requirements.txt / pyproject.toml that pins versions.
A Makefile or shell script for the common commands.
At least one test that proves the system actually works end-to-end.
You have rehearsed the 60-second pitch out loud, three times.
You can answer the 10 deep-dive questions above with no prep.

When all 10 are checked: ship it. Add the link to your resume. Begin applying.

11. The meta-message

Phases 1–10 give you the knowledge. The capstone gives you the proof. The interview is just the bridge between the two.

If you've made it this far in the curriculum, you have the technical chops to work alongside engineers at Anthropic, OpenAI, DeepMind, Meta FAIR. The remaining 20% of the work — the README, the diagram, the rehearsed pitch — is what separates a candidate who can do the job from a candidate who gets the job.

Ship the capstone. Then write the resume bullet:

Built [system] from scratch — [throughput / quality / cost number]. Reproduced [reference benchmark] within X%. Open-source on GitHub: [link].

That's the bullet that puts you in the interview room. Phases 1–10 get you the offer once you're there.

Good luck. 🛸

LLM Inference Engineer