Lab 01 — Char-Level RNN (Solution Walkthrough)

Phase: 3 — RNNs & Language Modeling | Difficulty: ⭐⭐⭐☆☆ | Time: 2–4 hours

Concept primer: ../HITCHHIKERS-GUIDE.md §RNNs and §BPTT. This document walks through solution.py.

Run

pip install -r requirements.txt
curl -O https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
python solution.py --data input.txt --steps 2000

Train a vanilla RNN (no nn.RNN, no LSTM — by hand) to predict the next character on Tiny Shakespeare (~1 MB). At the end you'll sample text that looks Elizabethan even if it's nonsense. Karpathy's 2015 demo that proved RNNs could do generative modeling.

You are deliberately implementing the worst sequence model — the simplest possible RNN — to feel why Transformers were invented:

Sequential decoding (no parallel forward over T positions).
Vanishing gradients past ~50 timesteps.
Information bottleneck through a single hidden state.

1. The math

A vanilla recurrent cell at each step:

$$ h_t = \tanh(W_{ih} x_t + W_{hh} h_{t-1} + b) $$

For LM, project to logits: $\hat{p}t = \mathrm{softmax}(W{ho} h_t)$. Train with cross-entropy on next-character.

2. `VanillaRNNCell`

class VanillaRNNCell(nn.Module):
    def __init__(self, in_dim, hidden_dim):
        super().__init__()
        self.W_ih = nn.Linear(in_dim, hidden_dim, bias=False)
        self.W_hh = nn.Linear(hidden_dim, hidden_dim, bias=True)

    def forward(self, x, h):
        return torch.tanh(self.W_ih(x) + self.W_hh(h))

Two linear layers, one bias. Convention: bias on the recurrent path only (input path's bias is redundant after summing).
tanh not ReLU. ReLU + recurrent multiplication is unstable: positive activations grow without bound across time-steps. tanh ∈ [-1, 1] keeps state bounded.

3. `CharRNN`

class CharRNN(nn.Module):
    def __init__(self, vocab_size, hidden_dim=256):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, hidden_dim)
        self.cell = VanillaRNNCell(hidden_dim, hidden_dim)
        self.head = nn.Linear(hidden_dim, vocab_size)
        self.hidden_dim = hidden_dim

    def forward(self, x):
        B, T = x.shape
        h = x.new_zeros(B, self.hidden_dim, dtype=torch.float)
        e = self.embed(x)
        outs = []
        for t in range(T):
            h = self.cell(e[:, t], h)
            outs.append(h)
        out = torch.stack(outs, dim=1)
        return self.head(out)

The for loop over time is the heart of the inefficiency. Every forward call serializes T cell evaluations. For T=128, that's 128 sequential CUDA launches → tiny kernels → GPU idle most of the time. Compare a Transformer where the whole sequence is processed in one matmul.

x.new_zeros(...) creates a zero tensor on the same device + dtype family as x — avoids manual .to(device).

4. `sample` — autoregressive generation

@torch.no_grad()
def sample(self, ctx, n, temperature=1.0):
    h = ctx.new_zeros(1, self.hidden_dim, dtype=torch.float)
    # Warm up on context
    for t in range(ctx.size(1)):
        e = self.embed(ctx[:, t])
        h = self.cell(e, h)
    out = [ctx]
    last = ctx[:, -1]
    for _ in range(n):
        e = self.embed(last)
        h = self.cell(e, h)
        logits = self.head(h) / max(1e-6, temperature)
        probs = F.softmax(logits, dim=-1)
        last = torch.multinomial(probs, 1).squeeze(-1)
        out.append(last.unsqueeze(1))
    return torch.cat(out, dim=1)

Two phases — exactly the same pattern you'll see in Phase 9's KV-cache lab:

Warm-up / prefill: run the cell across the prompt to populate the hidden state.
Decode: feed back the previously sampled token, take one cell step, sample.

Temperature: divides logits before softmax.

T = 1: model's natural distribution.
T < 1: sharper → more confident → more repetition.
T > 1: flatter → more diverse → more nonsense.

5. The training loop

data = torch.tensor([stoi[c] for c in text], dtype=torch.long)

def get_batch():
    ix = torch.randint(0, len(data) - args.seq_len - 1, (args.batch,))
    x = torch.stack([data[i:i + args.seq_len] for i in ix])
    y = torch.stack([data[i + 1:i + 1 + args.seq_len] for i in ix])
    return x.to(device), y.to(device)

For Tiny Shakespeare: vocab ~65 chars. Whole dataset fits as a single 1.1M-element tensor.

This is truncated BPTT: gradients only flow within each seq_len-long chunk; we never connect chunks across batch boundaries. For seq_len=128 we backprop through 128 timesteps. Beyond that, vanishing gradients would erase the signal anyway.

torch.nn.utils.clip_grad_norm_(model.parameters(), 5.0)

Non-optional for RNNs. Without it, exploding gradients (which RNNs do regularly) cause loss spikes and NaN. The clip threshold of 5.0 is empirically standard.

6. Expected output

Tiny Shakespeare, 2000 steps, hidden=256, seq_len=128, on a 4090 (~3 minutes):

step     0  loss=4.1742  ppl=64.97
step  1000  loss=1.7634  ppl=5.83
step  2000  loss=1.5897  ppl=4.91

----- T=0.8 -----
ROMEO: I will not be a man, and the king shall be the world,
And the world is the world's the world to thee...

Sanity numbers:

Initial loss ≈ log(65) ≈ 4.17. ✅
After training: loss ≈ 1.5–1.7, perplexity 4.5–5.5. Vanilla RNN can't go much lower; LSTMs reach ~1.4, Transformers ~1.2.
BPC (bits per char) = loss / log(2) ≈ 2.3.

7. Why is this so much worse than a Transformer?

Compared to Phase 4's MiniGPT on the same data:

Model	Loss	Wall time	Quality
Vanilla RNN, hidden=256	1.59	3 min	"Shakespeare-shaped"
MiniGPT, 6 layers d=128	1.32	1 min	Looks much more coherent

Two reasons:

Vanishing gradient — info from 100 chars ago contributes ~0 to the current hidden state. The Transformer attends directly with no decay.
Single bottleneck — entire history compressed into one 256-vector. Attention has 128×256 effective state.

The lab is about feeling this gap, not closing it.

8. Common pitfalls

Forgetting clip_grad_norm_ → loss explodes around step ~500 with nan output.
Don't carry h across batches in this lab — that's stateful RNN training, more complex.
reshape vs view — view requires contiguous memory; reshape doesn't. The model output from torch.stack is contiguous.
Forgetting @torch.no_grad() on sample — slowdown 2–3× and OOM on long generations.

9. Stretch exercises

Replace VanillaRNNCell with LSTMCell (still by hand). LSTM has 4 gates: input, forget, cell, output. Train for 2000 steps; expect loss → ~1.4 (vs 1.6 for vanilla).
Implement GRU (3 gates). Compare to LSTM.
Add layer normalization inside the cell — stabilizes longer-context training.
Statefulness: carry h across batches within an epoch; reset at epoch boundary.
Bigger context: train with seq_len=256 or 512. Watch loss saturate earlier than the Transformer would.
Sampling tricks: implement top-k and top-p (nucleus) sampling. Compare quality at the same temperature.
Time it: profile and confirm the for-loop over T dominates wall-time. The exact reason Transformers won.

10. What this lab proves about you

You can implement an autoregressive sequence model from scratch, train it stably, sample from it, and articulate exactly why this architecture lost to attention. Phase-3 milestone.

LLM Inference Engineer