01 — Concepts Cheatsheet (Top 20 Answers)

Crisp answers to the Top 20 Interview Questions from the master README. Each answer is intended to be ~60-90 seconds spoken.


1. Why scaled dot-product attention divides by √dₖ?

Without scaling, the dot-products q·k have variance proportional to dₖ (assuming q, k components are i.i.d. with variance 1). For dₖ=64, dot-products have stddev ~8, pushing softmax into saturation regions where gradients are near-zero. Dividing by √dₖ keeps the variance ≈ 1, so softmax stays in its sensitive range. This is purely about gradient flow / numerical stability at init — it's not about the math being "more correct" otherwise.


2. KV-cache: what's stored, why it speeds inference, memory cost.

Stored: per layer, per attention head, the key and value tensors for all previously generated tokens. Shape per layer: (batch, n_heads, seq_so_far, d_head). Two tensors (K and V).

Why faster: at decode step t, the new token only needs K/V for tokens [0..t-1] to compute its attention. With a cache, you reuse those — only new K/V for token t needs computing. Without cache, you redo all t forward passes from scratch every step → quadratic cost.

Memory: 2 (K+V) × n_layers × n_heads × d_head × seq × batch × bytes_per_element. For Llama-3-8B at 8k context, BF16: ~4 GB per request. This is why long contexts are expensive — the KV cache, not the weights, dominates GPU memory at scale.


3. Multi-Head vs Multi-Query vs Grouped-Query Attention.

  • MHA: each head has its own K, V projection. Highest quality, biggest KV cache.
  • MQA (Shazeer 2019): all heads share one K and V. KV cache shrinks by n_heads× (e.g., 32×). Quality slightly worse on hard tasks.
  • GQA (Ainslie 2023): heads grouped; one K/V per group. Tunable middle ground (e.g., 32 query heads, 8 KV groups in Llama-3 → 4× KV reduction with near-MHA quality).

Production large models (Llama 3, Mistral, Qwen) all use GQA — best Pareto point.


4. Pre-norm vs Post-norm — why pre-norm wins for deep transformers.

  • Post-norm (original "Attention Is All You Need"): x = LN(x + Attn(x)). The residual stream is normalized — gradients can vanish through deep stacks.
  • Pre-norm: x = x + Attn(LN(x)). The residual stream is unnormalized; the norm is just on the input to the sublayer. Gradient flows directly through the residual, no LN in the way.

Pre-norm is much more stable past ~12 layers and converges without needing learning-rate warmup gymnastics. Every modern LLM is pre-norm (or RMSNorm pre-norm).


5. RoPE vs ALiBi vs absolute positional embeddings.

  • Absolute (sinusoidal/learned): added to token embeddings. Doesn't extrapolate beyond trained context.
  • ALiBi: adds a position-dependent bias to attention scores. Linear penalty on distance. Extrapolates well, but no notion of orientation.
  • RoPE: rotates Q and K vectors by angles depending on position. The dot-product q_i · k_j then becomes a function of (i - j) (relative position). Extrapolates somewhat with tricks (NTK scaling, YaRN). Used by Llama, Mistral, Qwen, Gemma.

RoPE wins because it's relative and preserves the dot-product structure.


6. BPE: how training and tokenization work; why byte-level matters.

Training: start with a vocab of single characters (or single bytes). Repeatedly find the most frequent adjacent pair in the corpus → merge into a new token. Add to vocab. Repeat until target vocab size.

Encoding: greedily apply the learned merges (in order) to a string.

Byte-level (GPT-2/3/4): vocab starts at 256 single bytes, not Unicode chars. Guarantees any UTF-8 string can be encoded with no UNK token. Combined with a regex pre-tokenization step (so merges don't cross word boundaries weirdly).


7. Greedy / top-k / top-p / temperature — when to use which.

  • Greedy (temp=0, top-1): deterministic; best for math/code/JSON.
  • Temperature: divides logits before softmax. T<1 sharpens (more confident), T>1 flattens (more diverse). T=0.7 is a common chat default.
  • Top-k: keep the k most-likely tokens, renormalize, sample. Cuts the long tail.
  • Top-p (nucleus): keep the smallest set whose cumulative probability ≥ p. Adapts the cutoff to entropy — narrow when the model is confident, wider when not. Generally preferred over top-k.

In practice: temp=0.7, top_p=0.9 is a sane chat default; temp=0 for tasks with a single right answer.


8. PPO vs DPO vs ORPO vs RLHF vs RLAIF.

  • RLHF (PPO): train a reward model from preferences → use PPO to optimize policy against it. Powerful, but unstable; needs careful KL constraint to a reference model.
  • DPO (Rafailov 2023): re-derive PPO's optimum analytically and minimize a contrastive loss directly on (chosen, rejected) pairs. No reward model, no rollouts. Simpler, very competitive with PPO.
  • ORPO: combine SFT and preference loss in a single stage. Even simpler.
  • RLAIF: same loop as RLHF but the preference labels come from an LLM judge instead of humans. Cheaper, but quality bounded by judge.

Default for new projects in 2024+: DPO for stability + simplicity, then maybe PPO if you've maxed out DPO.


9. LoRA: math, why memory-efficient, what r and α control.

LoRA replaces a weight update ΔW (which would be full-rank) with a low-rank decomposition: ΔW = B A where A ∈ ℝ^{r×k}, B ∈ ℝ^{d×r}, with r << d, k. Forward: y = W x + (α/r) · B (A x). Only A and B train; W is frozen.

Memory savings: instead of d×k trainable params per matrix, you train r(d+k). For r=16, d=k=4096: 16M → 130k, ~120× fewer trainable params → optimizer states fit easily.

  • r: rank of the update; bigger = more capacity. r=8-32 is typical.
  • α: scaling factor; effective LR for the adapter is α/r. Convention: α = 2r.

10. QLoRA's tricks: NF4, double-quant, paged optimizers.

QLoRA = LoRA on a 4-bit quantized base model.

  • NF4 (NormalFloat 4-bit): a 4-bit datatype with quantization levels chosen to be normally distributed (since pretrained weights are approximately N(0, σ)). Information-theoretically near-optimal for normal data.
  • Double quantization: the quantization scales themselves are quantized, saving another ~0.4 bits/param.
  • Paged optimizers: page Adam's state in/out of GPU memory via NVIDIA Unified Memory, avoiding OOM spikes during gradient checkpointing.

Result: 7B fits in ~6 GB; 70B in ~48 GB → fine-tunable on a single A100 80GB.


11. RAG: chunking strategies, hybrid search, reranking, when RAG beats fine-tuning.

  • Chunking: token-aware sliding window (e.g., 400 tokens, 80 overlap) — preserves context across boundaries. Semantic / structural splits when source has structure (markdown headers).
  • Hybrid search: BM25 (lexical) + dense embeddings, fused via Reciprocal Rank Fusion. Catches both exact-match queries (names, IDs) and paraphrase queries.
  • Reranking: cross-encoder (e.g., bge-reranker) on top-50 → top-5. Single biggest quality lever in RAG; cheap relative to LLM.
  • RAG vs fine-tune: RAG when knowledge changes / per-tenant; fine-tune for new style, format, or capabilities. Often: do both.

12. FlashAttention: what makes it fast.

Standard attention materializes the T×T attention matrix in HBM (slow GPU memory). FlashAttention computes attention tile by tile, fusing matmul + softmax + matmul, keeping intermediates in SRAM (fast on-chip memory). Uses an online softmax algorithm so you never need the full row at once.

Result: same math, but ~2-5× faster wall-clock and linear memory in sequence length (vs quadratic). It's a memory I/O optimization, not an algorithmic one.


13. Continuous batching: vLLM's PagedAttention.

Static batching: pad all sequences to the longest, run them as a batch; the batch finishes when the slowest sequence finishes. Wasted compute and GPU sit idle.

Continuous batching: at each decode step, finished sequences leave and new ones enter. Requires dynamic batch shapes.

PagedAttention makes this efficient: KV cache stored in fixed-size blocks (like virtual memory pages). New requests get blocks from a free list; finished requests return blocks. No fragmentation; supports prefix sharing.

Combined effect: 2-5× throughput vs static batching at similar latency.


14. Quantization: PTQ vs QAT, INT8 vs FP8 vs INT4 (AWQ/GPTQ).

  • PTQ (post-training): quantize after training, calibrate scales on a small dataset. Fast, no retraining. Default for inference.
  • QAT (during training): simulate quantization in forward pass during training. Higher quality, much more expensive.
  • INT8: weights+activations 8-bit. Solid baseline. ~2× speedup, ~negligible quality loss.
  • FP8 (E4M3 / E5M2): 8-bit float, supported on H100/H200. Better dynamic range than INT8 → more accurate at the same bits.
  • INT4 (AWQ / GPTQ): 4-bit weights, BF16 activations. ~4× memory reduction, small but measurable quality drop. AWQ uses per-channel salient-weight protection; GPTQ uses Hessian-aware error compensation.

Modern serving stack: FP8 weights + FP8 KV cache + BF16 activations.


15. Speculative decoding: how it works and when it helps.

A small draft model generates K candidate tokens. The target model runs ONE forward pass that verifies all K in parallel (since attention can compute K logits at once). Accept the longest prefix that matches what target would have sampled (with a probabilistic check that preserves target's distribution).

Why faster: 1 target forward pass produces ≥1 token instead of exactly 1. If acceptance rate is ~70%, you get ~3 tokens per target call → ~3× speedup on decode.

Caveats: doesn't help prefill; requires a good draft (similar to target); breaks even if draft is too slow or acceptance too low. Variants: Medusa (multiple decoding heads on the target itself), Eagle (better drafting via embedding propagation).


16. Distributed training: DDP vs FSDP vs ZeRO vs Tensor Parallelism vs Pipeline Parallelism.

  • DDP: each GPU has full model copy; gradients all-reduced after backward. Simple; bound by per-GPU memory.
  • ZeRO (DeepSpeed) / FSDP (PyTorch): shard optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3 / FSDP) across data-parallel ranks. Communicate to gather params just-in-time during forward/backward.
  • Tensor Parallelism (Megatron): shard a single weight matrix across GPUs (column- or row-parallel). Each GPU holds a slice. Requires fast interconnect (NVLink); typically TP ≤ 8 (within-node).
  • Pipeline Parallelism: split model layers across GPUs into stages; mini-batch flows through. Memory savings linear in stages; needs micro-batching to hide bubbles.

Composition for 70B: TP=4 within node, PP=4 across nodes, DP=N replicas with FSDP sharding optimizer state.


17. MoE: routing, load balancing, capacity factor.

Mixture-of-Experts: each layer has E expert FFNs. A router picks top-K (usually 2) experts per token. Only those experts compute → sparse activation, large total params, fast inference per token.

  • Router: a linear layer producing E logits → top-K selection (often softmax + argmax).
  • Load balancing: without intervention, the router collapses to a few experts ("expert dropout"). Auxiliary loss penalizes imbalance (e.g., entropy-style or load-coefficient term).
  • Capacity factor: each expert handles at most (tokens / E) × C tokens; overflow tokens are dropped or skipped. C=1.25 typical.

Mixtral 8x7B: 47B total params, 13B active per token. Better quality-per-active-FLOP than dense.


18. Eval contamination: detect, prevent.

Risk: benchmark questions appear in the training corpus → inflated scores.

Detection:

  • N-gram overlap: search training data for 13-gram (or longer) substrings of eval questions. The Llama / GPT-3 papers do this.
  • Embedding-similarity scan for near-duplicates.
  • Loss-based: trained models tend to have suspiciously low perplexity on memorized test items vs. fresh paraphrases.

Prevention: filter training corpus against eval suites before training; use held-out / private eval sets; run paraphrased / fresh-test variants periodically; track "dynamic" benchmarks (e.g., LiveBench).


19. Hallucinations: causes and reduction.

Causes: (1) training-data noise/contradictions; (2) over-confident sampling at decode (low-prob tokens still get picked); (3) context insufficient to answer; (4) RLHF reward-hacks toward confident-sounding but wrong; (5) compression failure: model can't recall low-frequency facts.

Mitigations:

  • Retrieval grounding (RAG): condition on retrieved evidence; force citations.
  • Self-consistency: sample N answers, take majority — surfaces uncertainty.
  • Chain-of-verification: model generates, then critiques itself.
  • Calibration training: teach models to say "I don't know" via DPO with refusal preferences.
  • Decoding constraints: structured outputs / JSON mode; constrained-decoding for facts.
  • Eval: faithfulness metrics (RAGAS), TruthfulQA, FActScore.

20. Prompt injection — defenses.

Threat: untrusted text in the model's context (a tool result, a web page, an email) contains instructions that hijack the model.

Defenses (layered, no silver bullet):

  1. Privilege separation: untrusted data goes in clearly-marked sections; the system prompt instructs the model to never follow instructions inside them.
  2. Tool sandboxing: tools authorize on the user's identity, not the model's claims. Don't let the model exfiltrate via image: <unsafe-url> or fetch arbitrary URLs.
  3. Output filtering: scan model output for suspicious patterns (URLs to data exfil, prompt-leak markers).
  4. Input filtering: classifier on incoming docs for obvious "ignore previous instructions" payloads (defeats only naive attacks).
  5. Human-in-the-loop for destructive actions (file deletion, money movement, sending email).
  6. Defense-in-depth assumption: assume the model will be jailbroken at some rate; design the surrounding system so a jailbreak can't cause unbounded damage.

Simon Willison's framing: "If you can't tolerate the worst-case behavior of an LLM with full data access, don't give an LLM full data access."