04 — System Design Walkthroughs (Interview Prep Index)

Practice prompts mapped to the system-design/ folder.

How to Practice

For each prompt below:

  1. Read the prompt only — not the linked solution.
  2. Set a 45-minute timer.
  3. Whiteboard / type out: clarifying Qs → estimation → architecture → 3 deep dives → tradeoffs.
  4. Compare to the solution doc.
  5. Note 3 things you missed in a gaps.md for spaced repetition.

Prompts

P1. "Design an LLM inference service"

Variants you might be asked:

  • "...handling 100k QPS across multiple model sizes"
  • "...with multi-tenant rate limiting and per-tenant fine-tunes (LoRA)"
  • "...with sub-1s TTFT SLO at p99"

➜ See system-design/01-llm-inference-gateway.md


P2. "Walk me through pretraining a 70B model from scratch"

Variants:

  • "...on 1024 H100s, with a 1.5T token budget"
  • "...how would you handle a node failure mid-run?"
  • "...what numerical precision and why?"

➜ See system-design/02-distributed-pretraining.md


P3. "Design a RAG system over 100M documents at 1k QPS"

Variants:

  • "...with multi-tenant ACLs"
  • "...with hourly document updates"
  • "...how do you continuously evaluate it?"

➜ See system-design/03-rag-at-scale.md


P4. "Build a self-serve fine-tuning platform for internal users"

Variants:

  • "...support SFT, LoRA, and DPO methods"
  • "...with automatic eval gating"
  • "...how do you bin-pack jobs across a heterogeneous GPU fleet?"

➜ See system-design/04-finetuning-platform.md


P5. "Design a continuous evaluation platform for LLMs"

Variants:

  • "...how do you trust LLM-judge results?"
  • "...how do you run code evals safely?"
  • "...how do you detect benchmark contamination?"

➜ See system-design/05-eval-platform.md


P6. "Build a pretraining data pipeline from raw CommonCrawl"

Variants:

  • "...10TB of input, deduped + filtered + tokenized"
  • "...with PII scrubbing and lineage tracking"
  • "...how do you tune the data mix?"

➜ See system-design/06-pretraining-data-pipeline.md


Bonus / Less-Common Prompts

  • Long-context serving (1M tokens): KV cache management, paged attention, ring attention, sequence parallelism.
  • Edge-device LLM: 4-bit quant, GGUF/llama.cpp, on-device privacy.
  • Multi-modal serving: image+text inputs, vision encoder caching, modality routing.
  • Agentic system at scale: tool sandboxing, parallel tool calls, cost control, loop limits.
  • Cost-optimal cascading: small-model triage → big-model fallback; routing classifier.