01 — Concepts Cheatsheet (Top 20 Answers)
Crisp answers to the Top 20 Interview Questions from the master README. Each answer is intended to be ~60-90 seconds spoken.
1. Why scaled dot-product attention divides by √dₖ?
Without scaling, the dot-products q·k have variance proportional to dₖ (assuming q, k components are i.i.d. with variance 1). For dₖ=64, dot-products have stddev ~8, pushing softmax into saturation regions where gradients are near-zero. Dividing by √dₖ keeps the variance ≈ 1, so softmax stays in its sensitive range. This is purely about gradient flow / numerical stability at init — it's not about the math being "more correct" otherwise.
2. KV-cache: what's stored, why it speeds inference, memory cost.
Stored: per layer, per attention head, the key and value tensors for all previously generated tokens. Shape per layer: (batch, n_heads, seq_so_far, d_head). Two tensors (K and V).
Why faster: at decode step t, the new token only needs K/V for tokens [0..t-1] to compute its attention. With a cache, you reuse those — only new K/V for token t needs computing. Without cache, you redo all t forward passes from scratch every step → quadratic cost.
Memory: 2 (K+V) × n_layers × n_heads × d_head × seq × batch × bytes_per_element. For Llama-3-8B at 8k context, BF16: ~4 GB per request. This is why long contexts are expensive — the KV cache, not the weights, dominates GPU memory at scale.
3. Multi-Head vs Multi-Query vs Grouped-Query Attention.
- MHA: each head has its own K, V projection. Highest quality, biggest KV cache.
- MQA (Shazeer 2019): all heads share one K and V. KV cache shrinks by
n_heads× (e.g., 32×). Quality slightly worse on hard tasks. - GQA (Ainslie 2023): heads grouped; one K/V per group. Tunable middle ground (e.g., 32 query heads, 8 KV groups in Llama-3 → 4× KV reduction with near-MHA quality).
Production large models (Llama 3, Mistral, Qwen) all use GQA — best Pareto point.
4. Pre-norm vs Post-norm — why pre-norm wins for deep transformers.
- Post-norm (original "Attention Is All You Need"):
x = LN(x + Attn(x)). The residual stream is normalized — gradients can vanish through deep stacks. - Pre-norm:
x = x + Attn(LN(x)). The residual stream is unnormalized; the norm is just on the input to the sublayer. Gradient flows directly through the residual, no LN in the way.
Pre-norm is much more stable past ~12 layers and converges without needing learning-rate warmup gymnastics. Every modern LLM is pre-norm (or RMSNorm pre-norm).
5. RoPE vs ALiBi vs absolute positional embeddings.
- Absolute (sinusoidal/learned): added to token embeddings. Doesn't extrapolate beyond trained context.
- ALiBi: adds a position-dependent bias to attention scores. Linear penalty on distance. Extrapolates well, but no notion of orientation.
- RoPE: rotates Q and K vectors by angles depending on position. The dot-product
q_i · k_jthen becomes a function of(i - j)(relative position). Extrapolates somewhat with tricks (NTK scaling, YaRN). Used by Llama, Mistral, Qwen, Gemma.
RoPE wins because it's relative and preserves the dot-product structure.
6. BPE: how training and tokenization work; why byte-level matters.
Training: start with a vocab of single characters (or single bytes). Repeatedly find the most frequent adjacent pair in the corpus → merge into a new token. Add to vocab. Repeat until target vocab size.
Encoding: greedily apply the learned merges (in order) to a string.
Byte-level (GPT-2/3/4): vocab starts at 256 single bytes, not Unicode chars. Guarantees any UTF-8 string can be encoded with no UNK token. Combined with a regex pre-tokenization step (so merges don't cross word boundaries weirdly).
7. Greedy / top-k / top-p / temperature — when to use which.
- Greedy (
temp=0, top-1): deterministic; best for math/code/JSON. - Temperature: divides logits before softmax.
T<1sharpens (more confident),T>1flattens (more diverse).T=0.7is a common chat default. - Top-k: keep the k most-likely tokens, renormalize, sample. Cuts the long tail.
- Top-p (nucleus): keep the smallest set whose cumulative probability ≥ p. Adapts the cutoff to entropy — narrow when the model is confident, wider when not. Generally preferred over top-k.
In practice: temp=0.7, top_p=0.9 is a sane chat default; temp=0 for tasks with a single right answer.
8. PPO vs DPO vs ORPO vs RLHF vs RLAIF.
- RLHF (PPO): train a reward model from preferences → use PPO to optimize policy against it. Powerful, but unstable; needs careful KL constraint to a reference model.
- DPO (Rafailov 2023): re-derive PPO's optimum analytically and minimize a contrastive loss directly on (chosen, rejected) pairs. No reward model, no rollouts. Simpler, very competitive with PPO.
- ORPO: combine SFT and preference loss in a single stage. Even simpler.
- RLAIF: same loop as RLHF but the preference labels come from an LLM judge instead of humans. Cheaper, but quality bounded by judge.
Default for new projects in 2024+: DPO for stability + simplicity, then maybe PPO if you've maxed out DPO.
9. LoRA: math, why memory-efficient, what r and α control.
LoRA replaces a weight update ΔW (which would be full-rank) with a low-rank decomposition: ΔW = B A where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, with r << d, k. Forward: y = W x + (α/r) · B (A x). Only A and B train; W is frozen.
Memory savings: instead of d×k trainable params per matrix, you train r(d+k). For r=16, d=k=4096: 16M → 130k, ~120× fewer trainable params → optimizer states fit easily.
r: rank of the update; bigger = more capacity. r=8-32 is typical.α: scaling factor; effective LR for the adapter isα/r. Convention:α = 2r.
10. QLoRA's tricks: NF4, double-quant, paged optimizers.
QLoRA = LoRA on a 4-bit quantized base model.
- NF4 (NormalFloat 4-bit): a 4-bit datatype with quantization levels chosen to be normally distributed (since pretrained weights are approximately N(0, σ)). Information-theoretically near-optimal for normal data.
- Double quantization: the quantization scales themselves are quantized, saving another ~0.4 bits/param.
- Paged optimizers: page Adam's state in/out of GPU memory via NVIDIA Unified Memory, avoiding OOM spikes during gradient checkpointing.
Result: 7B fits in ~6 GB; 70B in ~48 GB → fine-tunable on a single A100 80GB.
11. RAG: chunking strategies, hybrid search, reranking, when RAG beats fine-tuning.
- Chunking: token-aware sliding window (e.g., 400 tokens, 80 overlap) — preserves context across boundaries. Semantic / structural splits when source has structure (markdown headers).
- Hybrid search: BM25 (lexical) + dense embeddings, fused via Reciprocal Rank Fusion. Catches both exact-match queries (names, IDs) and paraphrase queries.
- Reranking: cross-encoder (e.g.,
bge-reranker) on top-50 → top-5. Single biggest quality lever in RAG; cheap relative to LLM. - RAG vs fine-tune: RAG when knowledge changes / per-tenant; fine-tune for new style, format, or capabilities. Often: do both.
12. FlashAttention: what makes it fast.
Standard attention materializes the T×T attention matrix in HBM (slow GPU memory). FlashAttention computes attention tile by tile, fusing matmul + softmax + matmul, keeping intermediates in SRAM (fast on-chip memory). Uses an online softmax algorithm so you never need the full row at once.
Result: same math, but ~2-5× faster wall-clock and linear memory in sequence length (vs quadratic). It's a memory I/O optimization, not an algorithmic one.
13. Continuous batching: vLLM's PagedAttention.
Static batching: pad all sequences to the longest, run them as a batch; the batch finishes when the slowest sequence finishes. Wasted compute and GPU sit idle.
Continuous batching: at each decode step, finished sequences leave and new ones enter. Requires dynamic batch shapes.
PagedAttention makes this efficient: KV cache stored in fixed-size blocks (like virtual memory pages). New requests get blocks from a free list; finished requests return blocks. No fragmentation; supports prefix sharing.
Combined effect: 2-5× throughput vs static batching at similar latency.
14. Quantization: PTQ vs QAT, INT8 vs FP8 vs INT4 (AWQ/GPTQ).
- PTQ (post-training): quantize after training, calibrate scales on a small dataset. Fast, no retraining. Default for inference.
- QAT (during training): simulate quantization in forward pass during training. Higher quality, much more expensive.
- INT8: weights+activations 8-bit. Solid baseline. ~2× speedup, ~negligible quality loss.
- FP8 (E4M3 / E5M2): 8-bit float, supported on H100/H200. Better dynamic range than INT8 → more accurate at the same bits.
- INT4 (AWQ / GPTQ): 4-bit weights, BF16 activations. ~4× memory reduction, small but measurable quality drop. AWQ uses per-channel salient-weight protection; GPTQ uses Hessian-aware error compensation.
Modern serving stack: FP8 weights + FP8 KV cache + BF16 activations.
15. Speculative decoding: how it works and when it helps.
A small draft model generates K candidate tokens. The target model runs ONE forward pass that verifies all K in parallel (since attention can compute K logits at once). Accept the longest prefix that matches what target would have sampled (with a probabilistic check that preserves target's distribution).
Why faster: 1 target forward pass produces ≥1 token instead of exactly 1. If acceptance rate is ~70%, you get ~3 tokens per target call → ~3× speedup on decode.
Caveats: doesn't help prefill; requires a good draft (similar to target); breaks even if draft is too slow or acceptance too low. Variants: Medusa (multiple decoding heads on the target itself), Eagle (better drafting via embedding propagation).
16. Distributed training: DDP vs FSDP vs ZeRO vs Tensor Parallelism vs Pipeline Parallelism.
- DDP: each GPU has full model copy; gradients all-reduced after backward. Simple; bound by per-GPU memory.
- ZeRO (DeepSpeed) / FSDP (PyTorch): shard optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3 / FSDP) across data-parallel ranks. Communicate to gather params just-in-time during forward/backward.
- Tensor Parallelism (Megatron): shard a single weight matrix across GPUs (column- or row-parallel). Each GPU holds a slice. Requires fast interconnect (NVLink); typically TP ≤ 8 (within-node).
- Pipeline Parallelism: split model layers across GPUs into stages; mini-batch flows through. Memory savings linear in stages; needs micro-batching to hide bubbles.
Composition for 70B: TP=4 within node, PP=4 across nodes, DP=N replicas with FSDP sharding optimizer state.
17. MoE: routing, load balancing, capacity factor.
Mixture-of-Experts: each layer has E expert FFNs. A router picks top-K (usually 2) experts per token. Only those experts compute → sparse activation, large total params, fast inference per token.
- Router: a linear layer producing E logits → top-K selection (often softmax + argmax).
- Load balancing: without intervention, the router collapses to a few experts ("expert dropout"). Auxiliary loss penalizes imbalance (e.g., entropy-style or load-coefficient term).
- Capacity factor: each expert handles at most
(tokens / E) × Ctokens; overflow tokens are dropped or skipped. C=1.25 typical.
Mixtral 8x7B: 47B total params, 13B active per token. Better quality-per-active-FLOP than dense.
18. Eval contamination: detect, prevent.
Risk: benchmark questions appear in the training corpus → inflated scores.
Detection:
- N-gram overlap: search training data for 13-gram (or longer) substrings of eval questions. The Llama / GPT-3 papers do this.
- Embedding-similarity scan for near-duplicates.
- Loss-based: trained models tend to have suspiciously low perplexity on memorized test items vs. fresh paraphrases.
Prevention: filter training corpus against eval suites before training; use held-out / private eval sets; run paraphrased / fresh-test variants periodically; track "dynamic" benchmarks (e.g., LiveBench).
19. Hallucinations: causes and reduction.
Causes: (1) training-data noise/contradictions; (2) over-confident sampling at decode (low-prob tokens still get picked); (3) context insufficient to answer; (4) RLHF reward-hacks toward confident-sounding but wrong; (5) compression failure: model can't recall low-frequency facts.
Mitigations:
- Retrieval grounding (RAG): condition on retrieved evidence; force citations.
- Self-consistency: sample N answers, take majority — surfaces uncertainty.
- Chain-of-verification: model generates, then critiques itself.
- Calibration training: teach models to say "I don't know" via DPO with refusal preferences.
- Decoding constraints: structured outputs / JSON mode; constrained-decoding for facts.
- Eval: faithfulness metrics (RAGAS), TruthfulQA, FActScore.
20. Prompt injection — defenses.
Threat: untrusted text in the model's context (a tool result, a web page, an email) contains instructions that hijack the model.
Defenses (layered, no silver bullet):
- Privilege separation: untrusted data goes in clearly-marked sections; the system prompt instructs the model to never follow instructions inside them.
- Tool sandboxing: tools authorize on the user's identity, not the model's claims. Don't let the model exfiltrate via
image: <unsafe-url>or fetch arbitrary URLs. - Output filtering: scan model output for suspicious patterns (URLs to data exfil, prompt-leak markers).
- Input filtering: classifier on incoming docs for obvious "ignore previous instructions" payloads (defeats only naive attacks).
- Human-in-the-loop for destructive actions (file deletion, money movement, sending email).
- Defense-in-depth assumption: assume the model will be jailbroken at some rate; design the surrounding system so a jailbreak can't cause unbounded damage.
Simon Willison's framing: "If you can't tolerate the worst-case behavior of an LLM with full data access, don't give an LLM full data access."