Phase 4 — Attention & Transformers (From Scratch)

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2 weeks Roles supported: ALL research-engineer roles. The single most-asked LLM interview topic.

Why This Phase Exists

If you can derive scaled dot-product attention on a whiteboard, implement multi-head attention in <50 lines, explain RoPE, and walk through one forward pass of a transformer block — you pass the technical bar of nearly every LLM-engineering interview I have seen.

This is the most important phase. Do not rush it.

Concepts

Self-attention as content-based addressable memory
Scaled dot-product attention: softmax(QK^T / √d_k) V
Why divide by √d_k (variance argument)
Causal masking (decoder) vs padding masking (encoder)
Multi-head attention: parallel subspace projections
Positional encoding flavors:
- Sinusoidal (original Transformer)
- Learned absolute
- RoPE (rotary, used in Llama / Qwen / most modern decoders)
- ALiBi (used in MPT / BLOOM)
Layer normalization vs RMSNorm
Pre-norm vs post-norm (training stability)
Residual stream view (Anthropic's interpretability framing)
Feed-forward block (MLP) — usually 4× hidden dim, GELU/SwiGLU
Encoder vs decoder vs encoder-decoder topology
Parameter counting

Labs

Lab 01 — Scaled Dot-Product Attention From Scratch

Field	Value
Goal	Implement attention three ways and prove they match.
Concepts	Q/K/V projections, softmax over the right axis, masking.
Steps	1) Implement attention with explicit `for` loops (slow but pedagogical). 2) Implement vectorized version with `torch.bmm`. 3) Implement with `torch.einsum`. 4) Add causal mask using `torch.tril`. 5) Add padding mask. 6) Property test: all three implementations agree to 1e-6.
Stack	PyTorch
Output	`attention.py` with three implementations + tests.
How to Test	All three give identical output (within 1e-6); causal mask sets future positions to -inf pre-softmax.
Talking Points	Why √d_k (derive expected variance). Why mask before softmax (not after). Why softmax along the key axis.
Resume Bullet	"Implemented scaled dot-product attention three ways (loop / bmm / einsum) with causal and padding masks, validated to 1e-6 numerical agreement."
Extensions	Visualize attention weights on a toy "find-the-token" task.

Lab 02 — Multi-Head Attention

Field	Value
Goal	Build multi-head attention as a single fused operation; benchmark vs separate heads.
Concepts	Reshape trick `(B, T, n_head, d_head)`, single big linear projection vs per-head projections, output projection.
Steps	1) Naive: loop over heads. 2) Fused: single (3 × d_model) projection, reshape to heads, batched matmul. 3) Compare wall-clock. 4) Compare against `nn.MultiheadAttention`.
Stack	PyTorch
Output	`mha.py` with fused implementation + benchmark plot.
How to Test	Output matches `nn.MultiheadAttention` within 1e-5.
Talking Points	The "concat-then-project" view vs "project-then-concat" view (mathematically equivalent). Why heads enable subspace specialization.
Resume Bullet	"Implemented fused multi-head attention with reshape/permute optimizations, validated against `torch.nn.MultiheadAttention` and benchmarked to within 8% of the CuDNN-backed reference on an A100."
Extensions	Implement Grouped-Query Attention (GQA, used in Llama-3); implement MQA.

Lab 03 — Positional Encodings: Sinusoidal, RoPE, ALiBi

Field	Value
Goal	Implement and compare three positional schemes; understand long-context implications.
Concepts	Why transformers need positional info; absolute vs relative; RoPE rotation in complex plane; ALiBi linear bias.
Steps	1) Implement sinusoidal (original). 2) Implement learned positional embedding. 3) Implement RoPE (apply to Q and K). 4) Implement ALiBi bias. 5) Train tiny LM with each; compare extrapolation to longer sequences than seen at training.
Stack	PyTorch
Output	`positional.py` + an extrapolation plot (loss vs sequence length, train_len vs eval_len).
How to Test	RoPE and ALiBi should extrapolate noticeably better than sinusoidal/learned.
Talking Points	Why RoPE became dominant (Llama, Qwen, Gemma all use it). Why learned positional caps context length. The math of RoPE rotation.
Resume Bullet	"Implemented sinusoidal, learned, RoPE, and ALiBi positional encodings; demonstrated RoPE's 2.4× lower extrapolation perplexity at 4× training context length on a 4M-parameter LM."
Extensions	Implement RoPE scaling (NTK-aware, YaRN) — relevant to Llama-3 long-context.

Lab 04 — Mini Transformer Block + Full Decoder

Field	Value
Goal	Compose attention + MLP + norms into a transformer block, then stack into a decoder-only model.
Concepts	Pre-norm transformer block, residual stream, MLP with GELU/SwiGLU, parameter counting, weight tying.
Steps	1) Build `TransformerBlock` (Attn → MLP, both with pre-norm + residual). 2) Stack N blocks. 3) Add token + positional embeddings. 4) Tied LM head. 5) Compute parameter count manually; verify matches `sum(p.numel() for p in model.parameters())`. 6) Forward pass on dummy batch.
Stack	PyTorch
Output	`transformer.py` (~200 lines) — your reference implementation reused in Phase 5.
How to Test	Output shape correct; loss = uniform-distribution loss at init (`log(vocab_size)`); model overfits a single batch in <100 steps.
Talking Points	Why pre-norm > post-norm (training stability of deep stacks). Why MLP is 4× wider. Weight tying rationale. Anatomy of GPT-2 vs Llama-3 differences.
Resume Bullet	"Implemented a 200-line decoder-only transformer (multi-head attention + pre-norm + SwiGLU MLP + RoPE + tied LM head) and validated against init-loss and single-batch overfit sanity checks."
Extensions	Add KV-cache (preview of Phase 9); add Grouped-Query Attention; swap LayerNorm → RMSNorm.

Deliverables Checklist

Attention implementation (3 ways) with tests
Multi-head attention benchmarked against nn.MultiheadAttention
Positional-encoding ablation report
200-line transformer that overfits a single batch

Interview Relevance

This phase is the technical heart of LLM interviews. Expect:

Whiteboard derivation of attention
"Implement multi-head attention in 30 minutes"
"Compare RoPE and ALiBi"
"Walk through a transformer block"
Parameter-count math problems

LLM Inference Engineer