🛸 Hitchhiker's Guide — Phase 4: Attention and Transformers
Read this if: You want to be able to implement a transformer from scratch on a whiteboard, defend every design choice, and answer every variant of "explain attention" you'll get in an interview. This is the most important phase of the curriculum. Spend twice as long here as anywhere else.
0. The 30-second mental model
Attention is a content-based, weighted average. Given a query vector q and a set of key-value pairs {(k_i, v_i)}, compute similarities s_i = q · k_i, normalize them with softmax to get weights α_i, and return Σ α_i v_i. That's it. Everything else — multi-head, causal masking, RoPE, KV cache, FlashAttention — is a refinement of that one operation.
A transformer is a stack of "blocks", where each block applies (a) self-attention so every token can pull information from every other token, and (b) a position-wise MLP that processes each token's representation independently. Repeat 12, 32, 80, 96 times. Add a softmax head to predict the next token. Done.
By the end of Phase 4 you should:
- Derive scaled dot-product attention from first principles.
- Know exactly why we divide by
√d_k, why we use multi-head, why we use causal masking. - Implement RoPE (and explain why it's "relative" without an explicit
(i-j)). - Compare LayerNorm vs RMSNorm, GELU vs SwiGLU, post-norm vs pre-norm.
- Reason about KV-cache memory and its scaling.
- Implement a
MiniGPTfrom blank file in 30 minutes (the lab does ~150 lines).
1. The road to attention
1.1 Why RNNs needed help
In a seq2seq translation model, the encoder RNN summarizes the source sentence into a single fixed vector — and the decoder must squeeze the entire meaning of "The agreement on the European Economic Area was signed in August 1992" through this bottleneck. Disaster on long sentences.
1.2 Bahdanau attention (2015)
Bahdanau, Cho, Bengio added an "alignment" mechanism: at each decoder step, look at all encoder hidden states and softmax over their similarities to the current decoder state. Now the decoder gets a weighted average focused on the source tokens that matter for the current target token. Translation quality jumped immediately.
This is the seed crystal. Everything after is "attention but more so".
1.3 Attention Is All You Need (Vaswani et al., 2017)
The Google Brain team noticed: if attention is so good, why have the RNN at all? Replace the recurrence with attention layers. Add positional encodings (so the model knows token order without recurrence). Stack. Train.
The result was the Transformer. Every modern foundation model — GPT-4, Claude 4, Gemini 2.5, Llama-3, Mistral, DeepSeek — is a descendent of this paper.
2. Scaled Dot-Product Attention (the unit)
2.1 The math
Inputs: queries Q ∈ ℝ^{T×d_k}, keys K ∈ ℝ^{T×d_k}, values V ∈ ℝ^{T×d_v}. Output:
$$ \text{Attention}(Q, K, V) = \text{softmax}!\left(\frac{Q K^\top}{\sqrt{d_k}}\right) V $$
Step-by-step:
S = Q K^⊤ / √d_k— pairwise scores. Shape(T, T). Each rowS_isays how much tokenicares about every other token.P = softmax(S, dim=-1)— row-wise normalize.O = P V— output is a weighted sum of value vectors.
2.2 Why divide by √d_k?
If Q and K entries have unit variance and zero mean, then the dot product q · k (a sum of d_k independent products) has variance d_k. For d_k = 64, that's stddev 8. Pushing such large values into softmax saturates it: most weight goes to one element, gradients vanish.
Dividing by √d_k keeps the score variance ≈ 1 regardless of d_k. This is purely a numerical-stability trick at initialization, not a "more correct" formulation.
2.3 Causal masking (for decoder-only LMs)
For autoregressive generation, token t must not attend to tokens > t. Implement by setting the upper triangular entries of S to -∞ before softmax:
mask = torch.triu(torch.ones(T, T, dtype=torch.bool), diagonal=1)
scores = scores.masked_fill(mask, float("-inf"))
After softmax those entries become 0. This is what makes the transformer a language model in the autoregressive sense.
2.4 Why is attention O(T²)?
The score matrix is T × T. For long context (32k, 128k, 1M), this is the bottleneck. FlashAttention (Dao 2022) doesn't reduce the FLOPs but eliminates the materialization of the matrix in HBM, dramatically improving wall-clock and memory. Sparse / linear attention (Reformer, Linformer, Performer, Longformer) trades quality for sub-quadratic compute. Phase 9 covers all of these.
3. Multi-Head Attention
3.1 The intuition
Different "heads" can specialize in different relationships: one head tracks subject–verb agreement, another co-references pronouns, another keeps positional adjacency. A single attention does one weighted average; h heads do h of them in parallel and concatenate.
3.2 The math (and the parameter count)
Pick n_heads and d_head such that n_heads × d_head = d_model. Project the input three times with shape-d_model × d_model matrices W_Q, W_K, W_V, then reshape the result into (B, n_heads, T, d_head). Run scaled dot-product attention per head, concatenate, project with W_O.
# (B, T, C) —-> (B, n_heads, T, d_head)
q = self.W_q(x).view(B, T, n_heads, d_head).transpose(1, 2)
Total parameters in attention: 4 d_model² (Q, K, V, O). The MLP block is 8 d_model² (typically 4× expansion factor up and back). Each transformer block is ~12 d_model² parameters; total ≈ 12 d_model² × n_layers.
3.3 MHA → MQA → GQA
- MHA (vanilla): each head has its own K and V projections. Best quality, biggest KV cache.
- MQA (Shazeer 2019): all heads share one K and V. KV cache shrinks by
n_heads×. Slight quality drop on hard tasks. - GQA (Ainslie 2023): heads grouped; one K/V per group. Tunable middle ground (Llama-3 8B: 32 query heads, 8 KV groups). Now standard.
The motivation for MQA/GQA is inference: at long context, the KV cache dominates GPU memory, so reducing KV size directly increases batch-size headroom and throughput.
4. Position information
A transformer is permutation-equivariant without positional information — shuffle the input tokens and the output set is the same. We must inject positional signal somehow.
4.1 Sinusoidal positional encoding (Vaswani 2017)
Hand-designed sin/cos features added to the token embeddings. Each dimension oscillates at a different wavelength. Conceptually elegant; rarely used in modern LLMs.
4.2 Learned absolute positional embedding (BERT, GPT-2)
A learned (max_pos, d_model) matrix added to token embeddings. Simple but doesn't extrapolate beyond max_pos.
4.3 ALiBi (Press et al., 2022)
Adds a position-dependent bias to attention scores: s_{ij} ← s_{ij} - m · |i - j| for a per-head slope m. Linear penalty on distance. No vector positional encoding at all. Extrapolates to longer contexts than seen at train time.
4.4 RoPE (Su et al., 2021) — the modern winner
Rotary Positional Embedding rotates Q and K vectors by an angle that depends on position. Pair adjacent dimensions (x_{2i}, x_{2i+1}) into a 2D point, rotate by θ_i = pos · base^{-2i/d}. Critically, after rotation, the dot product q_i · k_j becomes a function purely of (i - j):
$$ q'_i \cdot k'_j = q_i \cdot k_j \cdot \cos((i-j)\theta) + (\text{cross terms involving } i-j) $$
So RoPE is relative without an explicit (i-j) term. Used by Llama, Mistral, Qwen, Gemma, and most open models.
Length extension tricks: NTK-aware scaling, YaRN, position interpolation. These adjust base or θ to extend a 4k-trained model to 32k or beyond at inference.
4.5 References
- Su et al. (2021), RoFormer.
- Press et al. (2022), Train Short, Test Long: Attention with Linear Biases (ALiBi).
- bloc97's NTK-aware RoPE blog post and YaRN (Peng et al. 2023).
5. The Transformer Block
5.1 The standard recipe (pre-norm, modern)
input x
┌─→ LayerNorm → CausalSelfAttention ─→ + (residual)
│ │
└───────────────────────────────────────┘
│
┌─→ LayerNorm → MLP ───────────────────→ + (residual)
│ │
└────────────────────────────────────────┘
output
That is: x = x + Attn(LN(x)) then x = x + MLP(LN(x)). Repeat N times.
5.2 Pre-norm vs Post-norm
- Post-norm (original 2017):
x = LN(x + Sublayer(x)). Gradients flow through the LayerNorm — vanish for deep stacks. Required learning-rate warmup gymnastics. - Pre-norm:
x = x + Sublayer(LN(x)). Gradient has a clean residual highway. Stable past 100+ layers.
Every modern LLM is pre-norm.
5.3 LayerNorm vs RMSNorm
LayerNorm: y = γ · (x - μ) / σ + β — subtract mean, divide by std, scale, shift.
RMSNorm: y = γ · x / RMS(x) — drop the mean subtraction, drop the bias. ~10% faster, no quality loss in practice. Used by Llama, Mistral, Qwen.
Why does dropping the mean work? Empirical observation backed by some analysis: the centering operation is largely redundant once activations are well-conditioned at depth.
5.4 The MLP block
mlp_out = down_proj(activation(up_proj(x)))
For most transformers, up_proj expands by 4× (so a d_model = 4096 model has a 16384-wide hidden layer in the MLP). Activation choices:
- ReLU: original; rarely used now.
- GELU: smooth ReLU; used by GPT-2, BERT.
- SwiGLU (Shazeer 2020):
(W_up x) ⊙ silu(W_gate x)— gated linear unit with Swish gating. Costs 50% more params but better quality at fixed FLOPs. Used by Llama, Qwen, Mistral.
5.5 Weight tying
The token embedding matrix E ∈ ℝ^{V × d} and the LM head matrix W_lm ∈ ℝ^{d × V} are often shared (W_lm = E^⊤). Saves V × d parameters (significant: 50k × 4096 = 200M). Justified theoretically by symmetry and empirically by similar or better perplexity. The MiniGPT lab implements this.
5.6 Initialization
You can't init transformer weights from a uniform [-1, 1]. Standard recipe (GPT-style):
- Token embeddings:
N(0, 0.02) - Linear layers:
N(0, 0.02) - Residual-stream output projections (
W_O,W_down):N(0, 0.02 / √(2 N))whereNis the number of layers — counteracts variance growth through the residual stream.
A correctly initialized model should have an initial loss of ≈ log(vocab_size) (uniform-distribution prediction). The lab's sanity_init_loss test checks exactly this.
6. Putting it together — the GPT-style architecture
input: token IDs (B, T)
│
▼
[Token Embedding] (V, d) → (B, T, d)
+
[Positional encoding (or RoPE applied inside attention)]
│
▼
[Block 1] = pre-norm + causal MHA + residual + pre-norm + MLP + residual
[Block 2]
...
[Block N]
│
▼
[Final LayerNorm]
│
▼
[LM Head] (d, V) — weight-tied to embedding
│
▼
logits (B, T, V)
│
▼
softmax → probabilities → loss (cross-entropy vs next-token target)
That's a complete decoder-only LLM. Llama, GPT-3, Claude, Gemini — same skeleton, different sizes and tweaks (RoPE flavor, GQA group count, SwiGLU, RMSNorm, attention bias removal).
6.1 Encoder vs decoder vs encoder-decoder
- Encoder (BERT): bidirectional attention; trained with masked LM. Used for classification, embeddings.
- Decoder (GPT, Claude, Llama): causal attention; autoregressive. Used for generation.
- Encoder-Decoder (T5, BART, original transformer): encoder reads input bidirectionally, decoder generates output causally with cross-attention to encoder. Used for translation, summarization (legacy).
In 2024+, decoder-only dominates. Why? Empirically, decoder-only with prompt-based learning matches encoder-decoder quality and is simpler to scale.
7. Lab walkthrough (lab-04-mini-transformer)
7.1 Architecture
The lab builds MiniGPT:
GPTConfigdataclass —vocab_size,n_layer,n_head,d_model,block_size,dropout.CausalSelfAttention— fused QKV projection (one matmul producing all three), reshape to heads, scaled dot-product, mask, softmax, weighted sum, output projection.MLP— Linear → GELU → Linear with 4× expansion.Block— pre-norm + attn + residual + pre-norm + MLP + residual.MiniGPT— embedding + position embedding + N blocks + final LN + tied LM head.
7.2 The two sanity tests
sanity_init_loss(): a freshly-initialized model on random tokens should produce a loss ≈ log(vocab_size). If yours is much higher, your init is broken; if much lower, you have a target leak.
sanity_overfit_one_batch(): take 1 batch, train for ~100 steps; loss should go to near zero. If it doesn't, you have a bug — gradient not flowing, wrong target alignment, frozen parameters. This is the single most useful debugging test.
7.3 Things to read in the solution
- The fused QKV projection:
qkv = self.c_attn(x)produces(B, T, 3*d_model)in one matmul; split into Q/K/V. Faster than three separate matmuls (better tensor-core utilization). - Causal mask is registered as a buffer — not a parameter, but moves with
.to(device). - The view → transpose → matmul → transpose → contiguous → view dance for multi-head — make sure you trace shapes by hand.
- Weight tying:
self.lm_head.weight = self.token_emb.weight.
8. References
Required:
- Vaswani et al. (2017), Attention Is All You Need — read it twice.
- Karpathy, Let's build GPT: from scratch, in code, spelled out — the YouTube lecture (~2 hours). Mandatory.
- Karpathy's
nanoGPT— read every line. - Lilian Weng, The Transformer Family — comprehensive blog overview.
- Jay Alammar, The Illustrated Transformer — best diagrams.
Important:
- Radford et al. (2018), Improving Language Understanding by Generative Pre-Training — GPT-1.
- Radford et al. (2019), Language Models are Unsupervised Multitask Learners — GPT-2.
- Brown et al. (2020), Language Models are Few-Shot Learners — GPT-3.
- Touvron et al. (2023), LLaMA: Open and Efficient Foundation Language Models; Llama-2 and Llama-3 papers.
- Devlin et al. (2018), BERT.
Architecture variants:
- Su et al. (2021), RoFormer (RoPE).
- Shazeer (2019), Fast Transformer Decoding: One Write-Head Is All You Need (MQA).
- Ainslie et al. (2023), GQA: Training Generalized Multi-Query Transformer Models.
- Shazeer (2020), GLU Variants Improve Transformer.
- Zhang & Sennrich (2019), Root Mean Square Layer Normalization (RMSNorm).
Theoretical:
- Elhage et al. (2021), A Mathematical Framework for Transformer Circuits (Anthropic) — circuits-level interpretability of attention.
- Olsson et al. (2022), In-Context Learning and Induction Heads (Anthropic).
- Phuong & Hutter (2022), Formal Algorithms for Transformers — pseudocode for everything.
9. Common interview questions on Phase 4 material
- Implement scaled dot-product attention on a whiteboard.
- Why divide by
√d_k? - Why multi-head and not single-head with bigger
d? - Compare MHA, MQA, GQA. When would you pick each?
- Compare absolute positional, ALiBi, and RoPE.
- Walk me through what happens during one forward pass of a 12-layer GPT.
- Why pre-norm and not post-norm?
- Why RMSNorm and not LayerNorm?
- What's weight tying and why does it help?
- What's the parameter count of a 32-layer, 4096-dim transformer with vocab 50k?
- Why is the time complexity of attention
O(T²)and what can you do about it? - Sketch how you'd add a KV cache to your
MiniGPT. (Bridges to Phase 9.) - Explain SwiGLU vs GELU.
- What's a residual stream? Why is it useful for analysis?
- What fails first as you scale a transformer to 70B and 1024 GPUs? (Bridges to Phase 10.)
10. From solid → exceptional
- Implement
MiniGPTfrom a blank file in 30 minutes without consultingsolution.py. Time yourself. - Add RoPE to your
MiniGPT(replace the additive position embedding). Compare loss curves. - Add MQA, then GQA. Measure throughput at long context.
- Replace GELU with SwiGLU. Compare equal-FLOP runs.
- Implement attention three ways (
einsum, manualbmm,F.scaled_dot_product_attention). Benchmark each. - Read Anthropic's A Mathematical Framework for Transformer Circuits and write a one-page summary of "induction heads".
- Pick a real released model (Llama-3 8B, Mistral 7B, Qwen2 7B). Read its config; identify every architectural choice and explain why it was made.
- Do a line-by-line annotation of
nanoGPT'smodel.pyin a markdown file. This is the most valuable single hour you can spend.
11. Recommended cadence
| Day | Activity |
|---|---|
| Mon | Read Attention Is All You Need slowly; sketch every diagram |
| Tue | Watch Karpathy's Let's build GPT lecture (~2 hours) |
| Wed | Read nanoGPT/model.py line by line; annotate |
| Thu | Lab 04 — implement MiniGPT from blank; run sanity tests |
| Fri | Implement RoPE replacement; benchmark vs absolute positional |
| Sat | Read GPT-1, 2, 3 papers (skim 1–2, read 3 in detail) |
| Sun | Practice the 15 interview questions out loud; whiteboard the architecture |