Phase 4 — Attention & Transformers (From Scratch)

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2 weeks Roles supported: ALL research-engineer roles. The single most-asked LLM interview topic.


Why This Phase Exists

If you can derive scaled dot-product attention on a whiteboard, implement multi-head attention in <50 lines, explain RoPE, and walk through one forward pass of a transformer block — you pass the technical bar of nearly every LLM-engineering interview I have seen.

This is the most important phase. Do not rush it.


Concepts

  • Self-attention as content-based addressable memory
  • Scaled dot-product attention: softmax(QK^T / √d_k) V
  • Why divide by √d_k (variance argument)
  • Causal masking (decoder) vs padding masking (encoder)
  • Multi-head attention: parallel subspace projections
  • Positional encoding flavors:
    • Sinusoidal (original Transformer)
    • Learned absolute
    • RoPE (rotary, used in Llama / Qwen / most modern decoders)
    • ALiBi (used in MPT / BLOOM)
  • Layer normalization vs RMSNorm
  • Pre-norm vs post-norm (training stability)
  • Residual stream view (Anthropic's interpretability framing)
  • Feed-forward block (MLP) — usually 4× hidden dim, GELU/SwiGLU
  • Encoder vs decoder vs encoder-decoder topology
  • Parameter counting

Labs

Lab 01 — Scaled Dot-Product Attention From Scratch

FieldValue
GoalImplement attention three ways and prove they match.
ConceptsQ/K/V projections, softmax over the right axis, masking.
Steps1) Implement attention with explicit for loops (slow but pedagogical). 2) Implement vectorized version with torch.bmm. 3) Implement with torch.einsum. 4) Add causal mask using torch.tril. 5) Add padding mask. 6) Property test: all three implementations agree to 1e-6.
StackPyTorch
Outputattention.py with three implementations + tests.
How to TestAll three give identical output (within 1e-6); causal mask sets future positions to -inf pre-softmax.
Talking PointsWhy √d_k (derive expected variance). Why mask before softmax (not after). Why softmax along the key axis.
Resume Bullet"Implemented scaled dot-product attention three ways (loop / bmm / einsum) with causal and padding masks, validated to 1e-6 numerical agreement."
ExtensionsVisualize attention weights on a toy "find-the-token" task.

Lab 02 — Multi-Head Attention

FieldValue
GoalBuild multi-head attention as a single fused operation; benchmark vs separate heads.
ConceptsReshape trick (B, T, n_head, d_head), single big linear projection vs per-head projections, output projection.
Steps1) Naive: loop over heads. 2) Fused: single (3 × d_model) projection, reshape to heads, batched matmul. 3) Compare wall-clock. 4) Compare against nn.MultiheadAttention.
StackPyTorch
Outputmha.py with fused implementation + benchmark plot.
How to TestOutput matches nn.MultiheadAttention within 1e-5.
Talking PointsThe "concat-then-project" view vs "project-then-concat" view (mathematically equivalent). Why heads enable subspace specialization.
Resume Bullet"Implemented fused multi-head attention with reshape/permute optimizations, validated against torch.nn.MultiheadAttention and benchmarked to within 8% of the CuDNN-backed reference on an A100."
ExtensionsImplement Grouped-Query Attention (GQA, used in Llama-3); implement MQA.

Lab 03 — Positional Encodings: Sinusoidal, RoPE, ALiBi

FieldValue
GoalImplement and compare three positional schemes; understand long-context implications.
ConceptsWhy transformers need positional info; absolute vs relative; RoPE rotation in complex plane; ALiBi linear bias.
Steps1) Implement sinusoidal (original). 2) Implement learned positional embedding. 3) Implement RoPE (apply to Q and K). 4) Implement ALiBi bias. 5) Train tiny LM with each; compare extrapolation to longer sequences than seen at training.
StackPyTorch
Outputpositional.py + an extrapolation plot (loss vs sequence length, train_len vs eval_len).
How to TestRoPE and ALiBi should extrapolate noticeably better than sinusoidal/learned.
Talking PointsWhy RoPE became dominant (Llama, Qwen, Gemma all use it). Why learned positional caps context length. The math of RoPE rotation.
Resume Bullet"Implemented sinusoidal, learned, RoPE, and ALiBi positional encodings; demonstrated RoPE's 2.4× lower extrapolation perplexity at 4× training context length on a 4M-parameter LM."
ExtensionsImplement RoPE scaling (NTK-aware, YaRN) — relevant to Llama-3 long-context.

Lab 04 — Mini Transformer Block + Full Decoder

FieldValue
GoalCompose attention + MLP + norms into a transformer block, then stack into a decoder-only model.
ConceptsPre-norm transformer block, residual stream, MLP with GELU/SwiGLU, parameter counting, weight tying.
Steps1) Build TransformerBlock (Attn → MLP, both with pre-norm + residual). 2) Stack N blocks. 3) Add token + positional embeddings. 4) Tied LM head. 5) Compute parameter count manually; verify matches sum(p.numel() for p in model.parameters()). 6) Forward pass on dummy batch.
StackPyTorch
Outputtransformer.py (~200 lines) — your reference implementation reused in Phase 5.
How to TestOutput shape correct; loss = uniform-distribution loss at init (log(vocab_size)); model overfits a single batch in <100 steps.
Talking PointsWhy pre-norm > post-norm (training stability of deep stacks). Why MLP is 4× wider. Weight tying rationale. Anatomy of GPT-2 vs Llama-3 differences.
Resume Bullet"Implemented a 200-line decoder-only transformer (multi-head attention + pre-norm + SwiGLU MLP + RoPE + tied LM head) and validated against init-loss and single-batch overfit sanity checks."
ExtensionsAdd KV-cache (preview of Phase 9); add Grouped-Query Attention; swap LayerNorm → RMSNorm.

Deliverables Checklist

  • Attention implementation (3 ways) with tests
  • Multi-head attention benchmarked against nn.MultiheadAttention
  • Positional-encoding ablation report
  • 200-line transformer that overfits a single batch

Interview Relevance

This phase is the technical heart of LLM interviews. Expect:

  • Whiteboard derivation of attention
  • "Implement multi-head attention in 30 minutes"
  • "Compare RoPE and ALiBi"
  • "Walk through a transformer block"
  • Parameter-count math problems