04 — System Design Walkthroughs (Interview Prep Index)
Practice prompts mapped to the system-design/ folder.
How to Practice
For each prompt below:
- Read the prompt only — not the linked solution.
- Set a 45-minute timer.
- Whiteboard / type out: clarifying Qs → estimation → architecture → 3 deep dives → tradeoffs.
- Compare to the solution doc.
- Note 3 things you missed in a
gaps.mdfor spaced repetition.
Prompts
P1. "Design an LLM inference service"
Variants you might be asked:
- "...handling 100k QPS across multiple model sizes"
- "...with multi-tenant rate limiting and per-tenant fine-tunes (LoRA)"
- "...with sub-1s TTFT SLO at p99"
➜ See system-design/01-llm-inference-gateway.md
P2. "Walk me through pretraining a 70B model from scratch"
Variants:
- "...on 1024 H100s, with a 1.5T token budget"
- "...how would you handle a node failure mid-run?"
- "...what numerical precision and why?"
➜ See system-design/02-distributed-pretraining.md
P3. "Design a RAG system over 100M documents at 1k QPS"
Variants:
- "...with multi-tenant ACLs"
- "...with hourly document updates"
- "...how do you continuously evaluate it?"
➜ See system-design/03-rag-at-scale.md
P4. "Build a self-serve fine-tuning platform for internal users"
Variants:
- "...support SFT, LoRA, and DPO methods"
- "...with automatic eval gating"
- "...how do you bin-pack jobs across a heterogeneous GPU fleet?"
➜ See system-design/04-finetuning-platform.md
P5. "Design a continuous evaluation platform for LLMs"
Variants:
- "...how do you trust LLM-judge results?"
- "...how do you run code evals safely?"
- "...how do you detect benchmark contamination?"
➜ See system-design/05-eval-platform.md
P6. "Build a pretraining data pipeline from raw CommonCrawl"
Variants:
- "...10TB of input, deduped + filtered + tokenized"
- "...with PII scrubbing and lineage tracking"
- "...how do you tune the data mix?"
➜ See system-design/06-pretraining-data-pipeline.md
Bonus / Less-Common Prompts
- Long-context serving (1M tokens): KV cache management, paged attention, ring attention, sequence parallelism.
- Edge-device LLM: 4-bit quant, GGUF/llama.cpp, on-device privacy.
- Multi-modal serving: image+text inputs, vision encoder caching, modality routing.
- Agentic system at scale: tool sandboxing, parallel tool calls, cost control, loop limits.
- Cost-optimal cascading: small-model triage → big-model fallback; routing classifier.