Phase 3 — RNNs & Language Modeling
Difficulty: ⭐⭐⭐☆☆ | Estimated Time: 1.5 weeks Roles supported: Foundation Model Engineer (historical literacy), all research-engineer roles (interview "explain attention" answer requires you to know what came before).
Why This Phase Exists
You will not deploy an RNN to production in 2026. But you will be asked in interviews:
- "Why did transformers replace RNNs?"
- "Explain LSTM gating mathematically."
- "What is teacher forcing?"
- "Where did attention come from?"
Building a char-RNN and a seq2seq model with Bahdanau attention is the cheapest way to internalize these answers — and it makes the leap to transformers in Phase 4 trivial.
Concepts
- Sequence modeling: P(x_t | x_<t)
- Vanilla RNN: hidden-state recurrence h_t = tanh(W_x x_t + W_h h_{t-1})
- Backpropagation through time (BPTT)
- Vanishing/exploding gradients (and the math behind why)
- LSTM: forget / input / output gates, cell state
- GRU: reset / update gates (simpler, often comparable)
- Sequence-to-sequence: encoder-decoder, fixed-context-vector bottleneck
- Bahdanau (additive) attention — the precursor to transformer attention
- Teacher forcing, scheduled sampling
- Perplexity = exp(cross-entropy loss)
Labs
Lab 01 — Vanilla RNN Char-Language-Model From Scratch
| Field | Value |
|---|---|
| Goal | Train a character-level RNN on Tiny Shakespeare and generate text. |
| Concepts | RNN forward, BPTT, character tokenization, sampling. |
| Steps | 1) Char-level tokenize Shakespeare. 2) Implement RNNCell from scratch (do NOT use nn.RNN). 3) Wrap in a loop with manual hidden-state propagation. 4) Cross-entropy loss. 5) Train ~1k steps. 6) Sample with temperature. |
| Stack | PyTorch (only nn.Linear, nn.Embedding, autograd) |
| Datasets | Tiny Shakespeare (1.1 MB) |
| Output | A model that generates pseudo-Shakespearean text; loss curve; sample output for temperature ∈ {0.5, 0.8, 1.2}. |
| How to Test | Loss decreases monotonically; samples become English-like over training. |
| Talking Points | Why vanilla RNNs vanish. Why we clip gradients. Why temperature controls diversity. |
| Resume Bullet | "Implemented a character-level RNN language model from scratch in PyTorch (no nn.RNN), trained on Tiny Shakespeare to perplexity 4.1, with temperature-controlled sampling demo." |
| Extensions | Add gradient clipping; add truncated BPTT for longer sequences. |
Lab 02 — LSTM & GRU (And Why They Help)
| Field | Value |
|---|---|
| Goal | Implement LSTM and GRU cells from scratch; reproduce gradient-flow advantage. |
| Concepts | LSTM gate equations, cell-state highway, GRU simplification, gradient flow comparison. |
| Steps | 1) Implement LSTMCell and GRUCell from primitives. 2) Train all three (RNN/LSTM/GRU) on Shakespeare. 3) Plot gradient norms over time. |
| Stack | PyTorch |
| Output | Three checkpoints + a gradient-norm plot + a perplexity comparison table. |
| How to Test | LSTM/GRU should beat vanilla RNN on perplexity within the same compute budget. |
| Talking Points | Walk through LSTM equations on whiteboard. Why the cell state has additive (not multiplicative) updates. When GRU matches LSTM. |
| Resume Bullet | "Implemented LSTM and GRU cells from scratch and demonstrated 38% perplexity reduction over vanilla RNN with controlled gradient-norm visualization." |
| Extensions | Add bidirectional LSTM; benchmark against nn.LSTM (CuDNN-fused) for wall-clock. |
Lab 03 — Seq2Seq + Bahdanau Attention (Toy Translation)
| Field | Value |
|---|---|
| Goal | Build an encoder-decoder with additive attention — the direct precursor to transformer attention. |
| Concepts | Encoder/decoder split, fixed-context bottleneck, additive attention scores, teacher forcing. |
| Steps | 1) Toy parallel corpus (e.g., date-format conversion: "March 14, 2024" → "2024-03-14"). 2) GRU encoder, GRU decoder. 3) First train without attention. 4) Add Bahdanau attention. 5) Compare both — attention should crush the baseline on long inputs. 6) Visualize attention weights as a heatmap. |
| Stack | PyTorch |
| Output | Two trained models + an attention heatmap PNG that clearly shows alignment. |
| How to Test | Attention model accuracy > non-attention by ≥ 15 points on long inputs. |
| Talking Points | The bottleneck problem. Why attention "looks back". The bridge from this to scaled-dot-product attention in Phase 4. |
| Resume Bullet | "Implemented Bahdanau additive attention in a seq2seq encoder-decoder, achieving 96% sequence accuracy on a date-normalization task vs 71% without attention; produced interpretable attention-alignment visualizations." |
| Extensions | Replace additive with dot-product (Luong) and compare — natural lead-in to Phase 4. |
Deliverables Checklist
- Char-RNN trained on Shakespeare with temperature sampling
- LSTM vs GRU vs RNN comparison + gradient-norm plot
- Seq2seq with attention + alignment heatmap
Interview Relevance
- "Why did transformers replace RNNs?" — parallelism + long-range dependencies
- "Walk me through LSTM gates"
- "Where does scaled-dot-product attention come from historically?"