🛸 Hitchhiker's Guide — Phase 5: Training Small LLMs

Read this if: You can build a MiniGPT but you've never trained one to convergence on real data, or you don't yet have a feel for "this loss curve looks healthy", "this is the LR I should use for a 124M model", "this is what 50 GPU-hours of pretraining buys you".

0. The 30-second mental model

Pretraining = run AdamW on a MiniGPT-style architecture for billions of next-token prediction steps over a giant deduplicated text corpus, using mixed precision, with a warmup-then-decay learning rate schedule, gradient accumulation to reach a large effective batch, and frequent checkpointing. Watch the loss go down. Sample. Cry tears of joy. That's pretraining.

By the end of Phase 5 you should:

Train nanoGPT on TinyStories and produce coherent toy text.
Understand and tune: batch size, learning rate, warmup, weight decay, gradient clipping, gradient accumulation, mixed precision (bf16/fp16/fp8).
Read and apply scaling laws (Kaplan, Chinchilla, MoE corrections).
Diagnose loss spikes, NaN, slow convergence, and undertraining.
Know the data preparation pipeline: tokenize → shard → memory-map → uint16 .bin.
Be ready to discuss real pretraining at the 1B–70B scale (Phase 10 will go deeper).

1. The pretraining objective

Same as Phase 3: minimize cross-entropy of next-token prediction. For a sequence of token IDs x_0, x_1, …, x_{T-1}, the model produces logits (T, V) and the loss is:

loss = F.cross_entropy(logits[:-1].reshape(-1, V), x[1:].reshape(-1))

Note the shift by 1: position t predicts position t+1. A common bug is forgetting this shift; the model then learns identity (loss → 0 instantly). The lab's sanity_overfit_one_batch catches it.

2. Optimizers — what's actually happening

2.1 SGD — the conceptual baseline

θ ← θ - η · ∇_θ L. Simple, but for transformers it's terrible without momentum and tuning.

2.2 Momentum / Nesterov

Track a running average of gradients; update with that. Smooths out noisy gradients.

2.3 Adam (Kingma & Ba, 2014)

For each parameter, maintain two moving averages:

m_t = β₁ m_{t-1} + (1 - β₁) g_t — first moment (mean of gradient).
v_t = β₂ v_{t-1} + (1 - β₂) g_t² — second moment (uncentered variance).

Bias-correct (m̂ = m / (1 - β₁ᵗ), etc.), then update:

$$ θ ← θ - η · \hat{m} / (\sqrt{\hat{v}} + ε) $$

Intuition: Adam is per-parameter learning-rate adaptation. Parameters with consistently large gradients get smaller effective updates; sparse-gradient parameters get larger ones.

2.4 AdamW (Loshchilov & Hutter, 2019)

Vanilla Adam with L2 regularization couples decay with the adaptive lr — wrong. AdamW decouples: θ ← θ - η (m̂/√v̂ + ε + λ θ). Same intuition, decay applied directly to weights. Always use AdamW, never Adam, for transformers.

Hyperparameters (sane defaults for transformers):

β = (0.9, 0.95) (note: β₂ = 0.95, not 0.999 — empirically better for LLMs)
weight_decay = 0.1
eps = 1e-8

2.5 Lion, Sophia, etc.

Recent alternatives. Lion (Chen et al. 2023) uses sign-of-momentum updates; smaller memory footprint. Sophia (Liu et al. 2023) uses Hessian estimates. Neither has displaced AdamW universally yet.

2.6 Memory cost

AdamW stores 2 floats per parameter (m, v). At fp32 that's 8 × params bytes. A 7B model = 56 GB just for optimizer states — more than the weights themselves. This is why we shard them in FSDP (Phase 10).

3. Learning rate schedules

The single biggest training-stability lever after batch size.

3.1 Warmup → Cosine decay (the workhorse)

Warmup (first 1–2% of steps): linearly increase from 0 to peak_lr. Without it, early steps with random weights produce huge gradients that destabilize training.
Cosine decay (remaining steps): lr = min_lr + 0.5 (peak_lr - min_lr) (1 + cos(π t/T_max)). Smooth descent to ~10% of peak.

3.2 Warmup-Stable-Decay (WSD)

Warmup → constant peak_lr for ~80% of training → fast cosine decay over last 10–20%.
Lets you take any intermediate checkpoint and "finalize" it with a short decay run. No need to commit to a token budget upfront.
Used in MiniCPM, DeepSeek and increasingly elsewhere.

3.3 What `peak_lr` to pick?

Empirical rule: peak_lr ≈ 6e-4 × (124M / params)^0.5 for GPT-style. For nanoGPT (124M): 6e-4. For 1B: ~2e-4. For 7B: ~1e-4. For 70B: ~3e-5.

You can also do a lr range test (Smith 2017): train for a few hundred steps with linearly-increasing lr; pick the lr where loss starts diverging, divide by 4–10. Lab 02 uses fixed sane defaults rather than tuning.

4. Batch size and gradient accumulation

4.1 Effective batch and tokens-per-step

Modern LLMs train at 0.5M–4M tokens per step (effective batch). You rarely fit that in one micro-batch on one GPU, so:

effective_batch_size = micro_batch × n_gpus × grad_accum_steps

grad_accum_steps accumulates gradients across forward/backward passes before the optimizer step:

opt.zero_grad()
for k in range(grad_accum_steps):
    micro = next_batch()
    loss = model(micro) / grad_accum_steps   # divide so loss is averaged
    loss.backward()                          # accumulates into .grad
opt.step()

This is mathematically equivalent to a single bigger batch (assuming no batch-norm — which transformers don't use).

4.2 The batch-size–LR coupling

When you increase the batch by k, you can usually increase the LR by k (linear scaling) or √k (sqrt scaling) without instability. For transformers the sqrt scaling is more conservative.

4.3 Critical batch size

McCandlish et al. (2018) showed each task has a critical batch size beyond which throughput improvements diminish. For LLMs the critical batch grows with model size — so you can use larger batches as you scale up.

5. Mixed precision

Goal: use lower-precision math to get more throughput per GPU and fit bigger models.

5.1 The four datatypes

Type	Bits	Exponent	Mantissa	Notes
FP32	32	8	23	Reference; "single precision"
FP16	16	5	10	Tiny range; needs loss scaling
BF16	16	8	7	Same range as FP32; loses mantissa precision
FP8 (E4M3)	8	4	3	H100+; needs per-tensor scaling
FP8 (E5M2)	8	5	2	Wider range; lower precision; gradients

BF16 is the default for pretraining in 2024+. Same exponent range as FP32 means you don't need loss scaling. Mantissa precision is enough for most ops if you keep certain reductions in FP32.

5.2 The recipe (PyTorch AMP)

scaler = torch.cuda.amp.GradScaler()              # FP16 path
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
    logits = model(x)
    loss = F.cross_entropy(logits, y)
loss.backward()
opt.step()
opt.zero_grad()

For BF16 you don't need GradScaler. For FP16 you do, because FP16's tiny range (~6e-5 minimum normal) underflows easily; the scaler multiplies the loss by a large number to keep gradients in range, then unscales before the optimizer step.

5.3 FP8 on H100

Hopper TensorCores natively run FP8 matmul at 2× the rate of BF16. Used with per-tensor delayed scaling (or per-block scaling for finer granularity). Library: NVIDIA's transformer_engine. Phase 10 covers it more deeply.

6. Scaling laws — the most important paper of the era

6.1 Kaplan et al. (2020) — Scaling Laws for Neural Language Models

Loss as a function of compute, parameters, and data follows a clean power law:

$$ L(N) \approx (N_c / N)^{α_N} $$

(Same for D and C.) The bombshell was: at fixed compute C ≈ 6 N D, the optimal allocation favored bigger models. GPT-3 was sized accordingly: 175B params, ~300B tokens.

6.2 Chinchilla (Hoffmann et al., 2022) — Training Compute-Optimal Large Language Models

DeepMind redid the analysis carefully and found N and D should scale equally at fixed compute — i.e., ~20 tokens per parameter is optimal. Implication: GPT-3 was massively undertrained. The 70B Chinchilla model trained on 1.4T tokens beat the 280B Gopher trained on 300B tokens.

This single finding reshaped the field. Llama models train at 200×+ tokens per param (LLama-3 8B trained on 15T tokens — far beyond Chinchilla optimal but yields better inference economics).

6.3 The compute equation

For a dense transformer:

$$ C ≈ 6 N D \text{ FLOPs} $$

where N = non-embedding parameters, D = training tokens. The 6 comes from 2 (multiply-add) × 3 (forward + backward + optimizer-related). Useful for back-of-envelope cost estimates.

6.4 References

Kaplan et al. (2020), Scaling Laws for Neural Language Models.
Hoffmann et al. (2022), Training Compute-Optimal Large Language Models (Chinchilla).
Henighan et al. (2020), Scaling Laws for Autoregressive Generative Modeling (multimodal).
Hoffmann's Chinchilla follow-ups. Replications: Pearce et al. (2024).

7. Data preparation for pretraining

Phase 10 covers this in depth. Quick preview:

Source: CommonCrawl (web), GitHub (code), arXiv (science), books, Wikipedia.
Filter: language ID, quality classifier, Gopher rules, perplexity filter.
Dedup: URL → exact → MinHash near-dup.
PII scrub: regex + Presidio.
Tokenize: with your tokenizer; output uint16 (vocab ≤ 65535) or uint32 .bin shards.
Mix and shuffle: weighted source mixing, deterministic shuffle.

For Lab 02 (nanoGPT on TinyStories): step 5 only. The dataset is small and pre-cleaned.

8. The lab walkthrough (lab-02-nano-gpt)

8.1 What you'll build

A working prepare → train → sample CLI that:

Prepare: downloads TinyStories (Eldan & Li, 2023; ~500MB of GPT-3.5-generated 4-year-old-level stories with vocabulary ~1500 words), tokenizes with GPT-2's tokenizer, dumps to train.bin / val.bin (uint16 memory-mapped arrays).
Train: imports MiniGPT and GPTConfig from Phase 4; trains for max_iters (default 5000) with bf16 AMP, gradient accumulation, cosine schedule.
Sample: loads checkpoint, runs autoregressive generation with top-k + temperature.

8.2 What "healthy" looks like

Initial loss ≈ log(50257) ≈ 10.8.
After 100 steps: ~6 (model has learned unigram distribution).
After 1000 steps: ~3 (basic word patterns).
After 5000 steps on TinyStories with a 6-layer 384-dim model: ~1.5–2.0 (coherent simple stories).

8.3 Why memory-mapped uint16 .bin?

A 5GB tokenized corpus loaded into RAM = 5GB. As np.memmap, it costs ~0 — only the active page is in memory. Cheap random access for batch sampling. uint16 (2 bytes/token) halves disk vs uint32.

8.4 Things to read carefully

get_batch() — random offsets within the .bin, slice block_size + 1 tokens, split into (x, y) with the +1 shift.
The training loop's grad_accum arithmetic.
The cosine schedule with warmup function.
torch.amp.autocast placement (only the forward; backward and optim step run in original precision).
The @torch.no_grad() eval block — saves memory.

8.5 Cost expectation

On a single A100 40GB, the default config (~10M params, 5k steps, batch 64 × 256 tokens) trains in ~15–30 minutes. On consumer GPU (4090): ~30–60 minutes. Generates believable toddler stories.

9. Diagnosing training problems

Symptom	Likely cause	Fix
Loss stuck near `log(V)`	Model isn't training; `requires_grad` off, or LR=0	Check optimizer.param_groups
Loss explodes to NaN at step 1	Bad init; LR too high	Init check; lower LR; add warmup
Loss dropping then suddenly NaN	Single bad batch; FP16 underflow	Gradient clipping; switch to BF16
Loss looks fine but generation is gibberish	Tokenizer mismatch; off-by-one in shift	Check decode of x[0] looks like text; verify `y = x[1:]`
Loss decreasing slowly	LR too low; batch too small	Raise LR; raise effective batch
Loss plateaus early	Undertrained or undersized	More tokens; bigger model
Eval loss diverges from train	Overfitting (rare in pretraining); data leak	More data; higher dropout (but transformers don't typically use dropout in pretraining)

10. References

Core:

Karpathy's nanoGPT repo and video lecture.
Kaplan et al. (2020) and Hoffmann et al. (2022) — scaling laws.
Loshchilov & Hutter (2019), Decoupled Weight Decay Regularization (AdamW).
Smith (2017), Cyclical Learning Rates for Training Neural Networks — LR range test.
Eldan & Li (2023), TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Production-scale recipes (read once you finish the lab):

OPT (Zhang et al. 2022) — has a release log of every restart and bug for a 175B model. Eye-opening.
Llama-3 tech report.
DeepSeek-V2 and DeepSeek-V3 tech reports.
Qwen-2 tech report.
Pythia (Biderman et al. 2023) — releases all checkpoints; great for studying training dynamics.

11. Common interview questions on Phase 5 material

Why AdamW and not Adam?
Why do we need LR warmup?
What's the Chinchilla finding in one sentence? Why did it overturn Kaplan?
How do you decide effective batch size?
Walk me through gradient accumulation.
Why BF16 over FP16 for pretraining?
What does the AdamW optimizer cost in memory per parameter?
Loss is NaN at step 200. How do you debug?
You have $50k of compute. What size model and how many tokens?
What's WSD and why is it interesting?
Sketch the training loop on a whiteboard.
How would you know if your model is undertrained?

12. From solid → exceptional

Train nanoGPT on TinyStories. Then train on all of Wikipedia (~30GB tokenized). Document loss curves and final perplexity.
Implement gradient checkpointing by hand (re-compute forward activations during backward instead of storing them). Measure the memory ↔ throughput tradeoff.
Implement torch.compile wrapping; benchmark step time before/after.
Add bf16 mixed precision with FP32 reductions explicitly (not via autocast); confirm equivalent loss.
Read the OPT log book end-to-end; pick three failures and write what you would have done differently.
Implement a scaling-law ablation: train models at sizes 6M, 12M, 25M, 50M for matched compute budgets; fit the power law; predict the loss at 100M; train and verify.
Write a one-page cost model: $/M-tokens-trained for various model sizes on H100 spot.

13. Recommended cadence

Day	Activity
Mon	Watch Karpathy's Let's reproduce GPT-2 (124M) video
Tue	Read Kaplan and Chinchilla papers
Wed	Lab 02 — get nanoGPT training; sample
Thu	Tune LR + batch; run 3 ablations; record loss curves
Fri	Add gradient checkpointing; benchmark
Sat	Read OPT log book; read Pythia paper
Sun	Mock-interview the 12 questions; whiteboard the training loop

AI Engineer — Role-Based Learning Hub