04 — System Design Walkthroughs (Interview Prep Index)

Practice prompts mapped to the system-design/ folder.

How to Practice

For each prompt below:

Read the prompt only — not the linked solution.
Set a 45-minute timer.
Whiteboard / type out: clarifying Qs → estimation → architecture → 3 deep dives → tradeoffs.
Compare to the solution doc.
Note 3 things you missed in a gaps.md for spaced repetition.

Prompts

P1. "Design an LLM inference service"

Variants you might be asked:

"...handling 100k QPS across multiple model sizes"
"...with multi-tenant rate limiting and per-tenant fine-tunes (LoRA)"
"...with sub-1s TTFT SLO at p99"

➜ See system-design/01-llm-inference-gateway.md

P2. "Walk me through pretraining a 70B model from scratch"

Variants:

"...on 1024 H100s, with a 1.5T token budget"
"...how would you handle a node failure mid-run?"
"...what numerical precision and why?"

➜ See system-design/02-distributed-pretraining.md

P3. "Design a RAG system over 100M documents at 1k QPS"

Variants:

"...with multi-tenant ACLs"
"...with hourly document updates"
"...how do you continuously evaluate it?"

➜ See system-design/03-rag-at-scale.md

P4. "Build a self-serve fine-tuning platform for internal users"

Variants:

"...support SFT, LoRA, and DPO methods"
"...with automatic eval gating"
"...how do you bin-pack jobs across a heterogeneous GPU fleet?"

➜ See system-design/04-finetuning-platform.md

P5. "Design a continuous evaluation platform for LLMs"

Variants:

"...how do you trust LLM-judge results?"
"...how do you run code evals safely?"
"...how do you detect benchmark contamination?"

➜ See system-design/05-eval-platform.md

P6. "Build a pretraining data pipeline from raw CommonCrawl"

Variants:

"...10TB of input, deduped + filtered + tokenized"
"...with PII scrubbing and lineage tracking"
"...how do you tune the data mix?"

➜ See system-design/06-pretraining-data-pipeline.md

Bonus / Less-Common Prompts

Long-context serving (1M tokens): KV cache management, paged attention, ring attention, sequence parallelism.
Edge-device LLM: 4-bit quant, GGUF/llama.cpp, on-device privacy.
Multi-modal serving: image+text inputs, vision encoder caching, modality routing.
Agentic system at scale: tool sandboxing, parallel tool calls, cost control, loop limits.
Cost-optimal cascading: small-model triage → big-model fallback; routing classifier.