Phase 4 — Attention & Transformers (From Scratch)
Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2 weeks Roles supported: ALL research-engineer roles. The single most-asked LLM interview topic.
Why This Phase Exists
If you can derive scaled dot-product attention on a whiteboard, implement multi-head attention in <50 lines, explain RoPE, and walk through one forward pass of a transformer block — you pass the technical bar of nearly every LLM-engineering interview I have seen.
This is the most important phase. Do not rush it.
Concepts
- Self-attention as content-based addressable memory
- Scaled dot-product attention:
softmax(QK^T / √d_k) V - Why divide by √d_k (variance argument)
- Causal masking (decoder) vs padding masking (encoder)
- Multi-head attention: parallel subspace projections
- Positional encoding flavors:
- Sinusoidal (original Transformer)
- Learned absolute
- RoPE (rotary, used in Llama / Qwen / most modern decoders)
- ALiBi (used in MPT / BLOOM)
- Layer normalization vs RMSNorm
- Pre-norm vs post-norm (training stability)
- Residual stream view (Anthropic's interpretability framing)
- Feed-forward block (MLP) — usually 4× hidden dim, GELU/SwiGLU
- Encoder vs decoder vs encoder-decoder topology
- Parameter counting
Labs
Lab 01 — Scaled Dot-Product Attention From Scratch
| Field | Value |
|---|---|
| Goal | Implement attention three ways and prove they match. |
| Concepts | Q/K/V projections, softmax over the right axis, masking. |
| Steps | 1) Implement attention with explicit for loops (slow but pedagogical). 2) Implement vectorized version with torch.bmm. 3) Implement with torch.einsum. 4) Add causal mask using torch.tril. 5) Add padding mask. 6) Property test: all three implementations agree to 1e-6. |
| Stack | PyTorch |
| Output | attention.py with three implementations + tests. |
| How to Test | All three give identical output (within 1e-6); causal mask sets future positions to -inf pre-softmax. |
| Talking Points | Why √d_k (derive expected variance). Why mask before softmax (not after). Why softmax along the key axis. |
| Resume Bullet | "Implemented scaled dot-product attention three ways (loop / bmm / einsum) with causal and padding masks, validated to 1e-6 numerical agreement." |
| Extensions | Visualize attention weights on a toy "find-the-token" task. |
Lab 02 — Multi-Head Attention
| Field | Value |
|---|---|
| Goal | Build multi-head attention as a single fused operation; benchmark vs separate heads. |
| Concepts | Reshape trick (B, T, n_head, d_head), single big linear projection vs per-head projections, output projection. |
| Steps | 1) Naive: loop over heads. 2) Fused: single (3 × d_model) projection, reshape to heads, batched matmul. 3) Compare wall-clock. 4) Compare against nn.MultiheadAttention. |
| Stack | PyTorch |
| Output | mha.py with fused implementation + benchmark plot. |
| How to Test | Output matches nn.MultiheadAttention within 1e-5. |
| Talking Points | The "concat-then-project" view vs "project-then-concat" view (mathematically equivalent). Why heads enable subspace specialization. |
| Resume Bullet | "Implemented fused multi-head attention with reshape/permute optimizations, validated against torch.nn.MultiheadAttention and benchmarked to within 8% of the CuDNN-backed reference on an A100." |
| Extensions | Implement Grouped-Query Attention (GQA, used in Llama-3); implement MQA. |
Lab 03 — Positional Encodings: Sinusoidal, RoPE, ALiBi
| Field | Value |
|---|---|
| Goal | Implement and compare three positional schemes; understand long-context implications. |
| Concepts | Why transformers need positional info; absolute vs relative; RoPE rotation in complex plane; ALiBi linear bias. |
| Steps | 1) Implement sinusoidal (original). 2) Implement learned positional embedding. 3) Implement RoPE (apply to Q and K). 4) Implement ALiBi bias. 5) Train tiny LM with each; compare extrapolation to longer sequences than seen at training. |
| Stack | PyTorch |
| Output | positional.py + an extrapolation plot (loss vs sequence length, train_len vs eval_len). |
| How to Test | RoPE and ALiBi should extrapolate noticeably better than sinusoidal/learned. |
| Talking Points | Why RoPE became dominant (Llama, Qwen, Gemma all use it). Why learned positional caps context length. The math of RoPE rotation. |
| Resume Bullet | "Implemented sinusoidal, learned, RoPE, and ALiBi positional encodings; demonstrated RoPE's 2.4× lower extrapolation perplexity at 4× training context length on a 4M-parameter LM." |
| Extensions | Implement RoPE scaling (NTK-aware, YaRN) — relevant to Llama-3 long-context. |
Lab 04 — Mini Transformer Block + Full Decoder
| Field | Value |
|---|---|
| Goal | Compose attention + MLP + norms into a transformer block, then stack into a decoder-only model. |
| Concepts | Pre-norm transformer block, residual stream, MLP with GELU/SwiGLU, parameter counting, weight tying. |
| Steps | 1) Build TransformerBlock (Attn → MLP, both with pre-norm + residual). 2) Stack N blocks. 3) Add token + positional embeddings. 4) Tied LM head. 5) Compute parameter count manually; verify matches sum(p.numel() for p in model.parameters()). 6) Forward pass on dummy batch. |
| Stack | PyTorch |
| Output | transformer.py (~200 lines) — your reference implementation reused in Phase 5. |
| How to Test | Output shape correct; loss = uniform-distribution loss at init (log(vocab_size)); model overfits a single batch in <100 steps. |
| Talking Points | Why pre-norm > post-norm (training stability of deep stacks). Why MLP is 4× wider. Weight tying rationale. Anatomy of GPT-2 vs Llama-3 differences. |
| Resume Bullet | "Implemented a 200-line decoder-only transformer (multi-head attention + pre-norm + SwiGLU MLP + RoPE + tied LM head) and validated against init-loss and single-batch overfit sanity checks." |
| Extensions | Add KV-cache (preview of Phase 9); add Grouped-Query Attention; swap LayerNorm → RMSNorm. |
Deliverables Checklist
- Attention implementation (3 ways) with tests
-
Multi-head attention benchmarked against
nn.MultiheadAttention - Positional-encoding ablation report
- 200-line transformer that overfits a single batch
Interview Relevance
This phase is the technical heart of LLM interviews. Expect:
- Whiteboard derivation of attention
- "Implement multi-head attention in 30 minutes"
- "Compare RoPE and ALiBi"
- "Walk through a transformer block"
- Parameter-count math problems