Lab 01 — Char-Level RNN (Solution Walkthrough)
Phase: 3 — RNNs & Language Modeling | Difficulty: ⭐⭐⭐☆☆ | Time: 2–4 hours
Concept primer:
../HITCHHIKERS-GUIDE.md§RNNs and §BPTT. This document walks throughsolution.py.
Run
pip install -r requirements.txt
curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
python solution.py --data input.txt --steps 2000
0. The mission
Train a vanilla RNN (no nn.RNN, no LSTM — by hand) to predict the next character on Tiny Shakespeare (~1 MB). At the end you'll sample text that looks Elizabethan even if it's nonsense. Karpathy's 2015 demo that proved RNNs could do generative modeling.
You are deliberately implementing the worst sequence model — the simplest possible RNN — to feel why Transformers were invented:
- Sequential decoding (no parallel forward over T positions).
- Vanishing gradients past ~50 timesteps.
- Information bottleneck through a single hidden state.
1. The math
A vanilla recurrent cell at each step:
$$ h_t = \tanh(W_{ih} x_t + W_{hh} h_{t-1} + b) $$
For LM, project to logits: $\hat{p}t = \mathrm{softmax}(W{ho} h_t)$. Train with cross-entropy on next-character.
2. VanillaRNNCell
class VanillaRNNCell(nn.Module):
def __init__(self, in_dim, hidden_dim):
super().__init__()
self.W_ih = nn.Linear(in_dim, hidden_dim, bias=False)
self.W_hh = nn.Linear(hidden_dim, hidden_dim, bias=True)
def forward(self, x, h):
return torch.tanh(self.W_ih(x) + self.W_hh(h))
- Two linear layers, one bias. Convention: bias on the recurrent path only (input path's bias is redundant after summing).
tanhnot ReLU. ReLU + recurrent multiplication is unstable: positive activations grow without bound across time-steps.tanh ∈ [-1, 1]keeps state bounded.
3. CharRNN
class CharRNN(nn.Module):
def __init__(self, vocab_size, hidden_dim=256):
super().__init__()
self.embed = nn.Embedding(vocab_size, hidden_dim)
self.cell = VanillaRNNCell(hidden_dim, hidden_dim)
self.head = nn.Linear(hidden_dim, vocab_size)
self.hidden_dim = hidden_dim
def forward(self, x):
B, T = x.shape
h = x.new_zeros(B, self.hidden_dim, dtype=torch.float)
e = self.embed(x)
outs = []
for t in range(T):
h = self.cell(e[:, t], h)
outs.append(h)
out = torch.stack(outs, dim=1)
return self.head(out)
The for loop over time is the heart of the inefficiency. Every forward call serializes T cell evaluations. For T=128, that's 128 sequential CUDA launches → tiny kernels → GPU idle most of the time. Compare a Transformer where the whole sequence is processed in one matmul.
x.new_zeros(...) creates a zero tensor on the same device + dtype family as x — avoids manual .to(device).
4. sample — autoregressive generation
@torch.no_grad()
def sample(self, ctx, n, temperature=1.0):
h = ctx.new_zeros(1, self.hidden_dim, dtype=torch.float)
# Warm up on context
for t in range(ctx.size(1)):
e = self.embed(ctx[:, t])
h = self.cell(e, h)
out = [ctx]
last = ctx[:, -1]
for _ in range(n):
e = self.embed(last)
h = self.cell(e, h)
logits = self.head(h) / max(1e-6, temperature)
probs = F.softmax(logits, dim=-1)
last = torch.multinomial(probs, 1).squeeze(-1)
out.append(last.unsqueeze(1))
return torch.cat(out, dim=1)
Two phases — exactly the same pattern you'll see in Phase 9's KV-cache lab:
- Warm-up / prefill: run the cell across the prompt to populate the hidden state.
- Decode: feed back the previously sampled token, take one cell step, sample.
Temperature: divides logits before softmax.
T = 1: model's natural distribution.T < 1: sharper → more confident → more repetition.T > 1: flatter → more diverse → more nonsense.
5. The training loop
data = torch.tensor([stoi[c] for c in text], dtype=torch.long)
def get_batch():
ix = torch.randint(0, len(data) - args.seq_len - 1, (args.batch,))
x = torch.stack([data[i:i + args.seq_len] for i in ix])
y = torch.stack([data[i + 1:i + 1 + args.seq_len] for i in ix])
return x.to(device), y.to(device)
For Tiny Shakespeare: vocab ~65 chars. Whole dataset fits as a single 1.1M-element tensor.
This is truncated BPTT: gradients only flow within each seq_len-long chunk; we never connect chunks across batch boundaries. For seq_len=128 we backprop through 128 timesteps. Beyond that, vanishing gradients would erase the signal anyway.
torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)
Non-optional for RNNs. Without it, exploding gradients (which RNNs do regularly) cause loss spikes and NaN. The clip threshold of 5.0 is empirically standard.
6. Expected output
Tiny Shakespeare, 2000 steps, hidden=256, seq_len=128, on a 4090 (~3 minutes):
step 0 loss=4.1742 ppl=64.97
step 1000 loss=1.7634 ppl=5.83
step 2000 loss=1.5897 ppl=4.91
----- T=0.8 -----
ROMEO: I will not be a man, and the king shall be the world,
And the world is the world's the world to thee...
Sanity numbers:
- Initial loss ≈
log(65)≈ 4.17. ✅ - After training: loss ≈ 1.5–1.7, perplexity 4.5–5.5. Vanilla RNN can't go much lower; LSTMs reach ~1.4, Transformers ~1.2.
- BPC (bits per char) =
loss / log(2)≈ 2.3.
7. Why is this so much worse than a Transformer?
Compared to Phase 4's MiniGPT on the same data:
| Model | Loss | Wall time | Quality |
|---|---|---|---|
| Vanilla RNN, hidden=256 | 1.59 | 3 min | "Shakespeare-shaped" |
| MiniGPT, 6 layers d=128 | 1.32 | 1 min | Looks much more coherent |
Two reasons:
- Vanishing gradient — info from 100 chars ago contributes ~0 to the current hidden state. The Transformer attends directly with no decay.
- Single bottleneck — entire history compressed into one 256-vector. Attention has 128×256 effective state.
The lab is about feeling this gap, not closing it.
8. Common pitfalls
- Forgetting
clip_grad_norm_→ loss explodes around step ~500 withnanoutput. - Don't carry
hacross batches in this lab — that's stateful RNN training, more complex. reshapevsview—viewrequires contiguous memory;reshapedoesn't. The model output fromtorch.stackis contiguous.- Forgetting
@torch.no_grad()onsample— slowdown 2–3× and OOM on long generations.
9. Stretch exercises
- Replace
VanillaRNNCellwithLSTMCell(still by hand). LSTM has 4 gates: input, forget, cell, output. Train for 2000 steps; expect loss → ~1.4 (vs 1.6 for vanilla). - Implement GRU (3 gates). Compare to LSTM.
- Add layer normalization inside the cell — stabilizes longer-context training.
- Statefulness: carry
hacross batches within an epoch; reset at epoch boundary. - Bigger context: train with
seq_len=256or512. Watch loss saturate earlier than the Transformer would. - Sampling tricks: implement top-k and top-p (nucleus) sampling. Compare quality at the same temperature.
- Time it: profile and confirm the for-loop over T dominates wall-time. The exact reason Transformers won.
10. What this lab proves about you
You can implement an autoregressive sequence model from scratch, train it stably, sample from it, and articulate exactly why this architecture lost to attention. Phase-3 milestone.