Phase 3 — RNNs & Language Modeling

Difficulty: ⭐⭐⭐☆☆ | Estimated Time: 1.5 weeks Roles supported: Foundation Model Engineer (historical literacy), all research-engineer roles (interview "explain attention" answer requires you to know what came before).

Why This Phase Exists

You will not deploy an RNN to production in 2026. But you will be asked in interviews:

"Why did transformers replace RNNs?"
"Explain LSTM gating mathematically."
"What is teacher forcing?"
"Where did attention come from?"

Building a char-RNN and a seq2seq model with Bahdanau attention is the cheapest way to internalize these answers — and it makes the leap to transformers in Phase 4 trivial.

Concepts

Sequence modeling: P(x_t | x_<t)
Vanilla RNN: hidden-state recurrence h_t = tanh(W_x x_t + W_h h_{t-1})
Backpropagation through time (BPTT)
Vanishing/exploding gradients (and the math behind why)
LSTM: forget / input / output gates, cell state
GRU: reset / update gates (simpler, often comparable)
Sequence-to-sequence: encoder-decoder, fixed-context-vector bottleneck
Bahdanau (additive) attention — the precursor to transformer attention
Teacher forcing, scheduled sampling
Perplexity = exp(cross-entropy loss)

Labs

Lab 01 — Vanilla RNN Char-Language-Model From Scratch

Field	Value
Goal	Train a character-level RNN on Tiny Shakespeare and generate text.
Concepts	RNN forward, BPTT, character tokenization, sampling.
Steps	1) Char-level tokenize Shakespeare. 2) Implement `RNNCell` from scratch (do NOT use `nn.RNN`). 3) Wrap in a loop with manual hidden-state propagation. 4) Cross-entropy loss. 5) Train ~1k steps. 6) Sample with temperature.
Stack	PyTorch (only `nn.Linear`, `nn.Embedding`, autograd)
Datasets	Tiny Shakespeare (1.1 MB)
Output	A model that generates pseudo-Shakespearean text; loss curve; sample output for temperature ∈ {0.5, 0.8, 1.2}.
How to Test	Loss decreases monotonically; samples become English-like over training.
Talking Points	Why vanilla RNNs vanish. Why we clip gradients. Why temperature controls diversity.
Resume Bullet	"Implemented a character-level RNN language model from scratch in PyTorch (no `nn.RNN`), trained on Tiny Shakespeare to perplexity 4.1, with temperature-controlled sampling demo."
Extensions	Add gradient clipping; add truncated BPTT for longer sequences.

Lab 02 — LSTM & GRU (And Why They Help)

Field	Value
Goal	Implement LSTM and GRU cells from scratch; reproduce gradient-flow advantage.
Concepts	LSTM gate equations, cell-state highway, GRU simplification, gradient flow comparison.
Steps	1) Implement `LSTMCell` and `GRUCell` from primitives. 2) Train all three (RNN/LSTM/GRU) on Shakespeare. 3) Plot gradient norms over time.
Stack	PyTorch
Output	Three checkpoints + a gradient-norm plot + a perplexity comparison table.
How to Test	LSTM/GRU should beat vanilla RNN on perplexity within the same compute budget.
Talking Points	Walk through LSTM equations on whiteboard. Why the cell state has additive (not multiplicative) updates. When GRU matches LSTM.
Resume Bullet	"Implemented LSTM and GRU cells from scratch and demonstrated 38% perplexity reduction over vanilla RNN with controlled gradient-norm visualization."
Extensions	Add bidirectional LSTM; benchmark against `nn.LSTM` (CuDNN-fused) for wall-clock.

Lab 03 — Seq2Seq + Bahdanau Attention (Toy Translation)

Field	Value
Goal	Build an encoder-decoder with additive attention — the direct precursor to transformer attention.
Concepts	Encoder/decoder split, fixed-context bottleneck, additive attention scores, teacher forcing.
Steps	1) Toy parallel corpus (e.g., date-format conversion: "March 14, 2024" → "2024-03-14"). 2) GRU encoder, GRU decoder. 3) First train without attention. 4) Add Bahdanau attention. 5) Compare both — attention should crush the baseline on long inputs. 6) Visualize attention weights as a heatmap.
Stack	PyTorch
Output	Two trained models + an attention heatmap PNG that clearly shows alignment.
How to Test	Attention model accuracy > non-attention by ≥ 15 points on long inputs.
Talking Points	The bottleneck problem. Why attention "looks back". The bridge from this to scaled-dot-product attention in Phase 4.
Resume Bullet	"Implemented Bahdanau additive attention in a seq2seq encoder-decoder, achieving 96% sequence accuracy on a date-normalization task vs 71% without attention; produced interpretable attention-alignment visualizations."
Extensions	Replace additive with dot-product (Luong) and compare — natural lead-in to Phase 4.

Deliverables Checklist

Char-RNN trained on Shakespeare with temperature sampling
LSTM vs GRU vs RNN comparison + gradient-norm plot
Seq2seq with attention + alignment heatmap

Interview Relevance

"Why did transformers replace RNNs?" — parallelism + long-range dependencies
"Walk me through LSTM gates"
"Where does scaled-dot-product attention come from historically?"

LLM Inference Engineer