🛸 Hitchhiker's Guide — Phase 5: Training Small LLMs
Read this if: You can build a
MiniGPTbut you've never trained one to convergence on real data, or you don't yet have a feel for "this loss curve looks healthy", "this is the LR I should use for a 124M model", "this is what 50 GPU-hours of pretraining buys you".
0. The 30-second mental model
Pretraining = run AdamW on a MiniGPT-style architecture for billions of next-token prediction steps over a giant deduplicated text corpus, using mixed precision, with a warmup-then-decay learning rate schedule, gradient accumulation to reach a large effective batch, and frequent checkpointing. Watch the loss go down. Sample. Cry tears of joy. That's pretraining.
By the end of Phase 5 you should:
- Train nanoGPT on TinyStories and produce coherent toy text.
- Understand and tune: batch size, learning rate, warmup, weight decay, gradient clipping, gradient accumulation, mixed precision (bf16/fp16/fp8).
- Read and apply scaling laws (Kaplan, Chinchilla, MoE corrections).
- Diagnose loss spikes, NaN, slow convergence, and undertraining.
- Know the data preparation pipeline: tokenize → shard → memory-map → uint16 .bin.
- Be ready to discuss real pretraining at the 1B–70B scale (Phase 10 will go deeper).
1. The pretraining objective
Same as Phase 3: minimize cross-entropy of next-token prediction. For a sequence of token IDs x_0, x_1, …, x_{T-1}, the model produces logits (T, V) and the loss is:
loss = F.cross_entropy(logits[:-1].reshape(-1, V), x[1:].reshape(-1))
Note the shift by 1: position t predicts position t+1. A common bug is forgetting this shift; the model then learns identity (loss → 0 instantly). The lab's sanity_overfit_one_batch catches it.
2. Optimizers — what's actually happening
2.1 SGD — the conceptual baseline
θ ← θ - η · ∇_θ L. Simple, but for transformers it's terrible without momentum and tuning.
2.2 Momentum / Nesterov
Track a running average of gradients; update with that. Smooths out noisy gradients.
2.3 Adam (Kingma & Ba, 2014)
For each parameter, maintain two moving averages:
m_t = β₁ m_{t-1} + (1 - β₁) g_t— first moment (mean of gradient).v_t = β₂ v_{t-1} + (1 - β₂) g_t²— second moment (uncentered variance).
Bias-correct (m̂ = m / (1 - β₁ᵗ), etc.), then update:
$$ θ ← θ - η · \hat{m} / (\sqrt{\hat{v}} + ε) $$
Intuition: Adam is per-parameter learning-rate adaptation. Parameters with consistently large gradients get smaller effective updates; sparse-gradient parameters get larger ones.
2.4 AdamW (Loshchilov & Hutter, 2019)
Vanilla Adam with L2 regularization couples decay with the adaptive lr — wrong. AdamW decouples: θ ← θ - η (m̂/√v̂ + ε + λ θ). Same intuition, decay applied directly to weights. Always use AdamW, never Adam, for transformers.
Hyperparameters (sane defaults for transformers):
β = (0.9, 0.95)(note:β₂ = 0.95, not 0.999 — empirically better for LLMs)weight_decay = 0.1eps = 1e-8
2.5 Lion, Sophia, etc.
Recent alternatives. Lion (Chen et al. 2023) uses sign-of-momentum updates; smaller memory footprint. Sophia (Liu et al. 2023) uses Hessian estimates. Neither has displaced AdamW universally yet.
2.6 Memory cost
AdamW stores 2 floats per parameter (m, v). At fp32 that's 8 × params bytes. A 7B model = 56 GB just for optimizer states — more than the weights themselves. This is why we shard them in FSDP (Phase 10).
3. Learning rate schedules
The single biggest training-stability lever after batch size.
3.1 Warmup → Cosine decay (the workhorse)
- Warmup (first 1–2% of steps): linearly increase from 0 to
peak_lr. Without it, early steps with random weights produce huge gradients that destabilize training. - Cosine decay (remaining steps):
lr = min_lr + 0.5 (peak_lr - min_lr) (1 + cos(π t/T_max)). Smooth descent to ~10% of peak.
3.2 Warmup-Stable-Decay (WSD)
- Warmup → constant
peak_lrfor ~80% of training → fast cosine decay over last 10–20%. - Lets you take any intermediate checkpoint and "finalize" it with a short decay run. No need to commit to a token budget upfront.
- Used in MiniCPM, DeepSeek and increasingly elsewhere.
3.3 What peak_lr to pick?
Empirical rule: peak_lr ≈ 6e-4 × (124M / params)^0.5 for GPT-style. For nanoGPT (124M): 6e-4. For 1B: ~2e-4. For 7B: ~1e-4. For 70B: ~3e-5.
You can also do a lr range test (Smith 2017): train for a few hundred steps with linearly-increasing lr; pick the lr where loss starts diverging, divide by 4–10. Lab 02 uses fixed sane defaults rather than tuning.
4. Batch size and gradient accumulation
4.1 Effective batch and tokens-per-step
Modern LLMs train at 0.5M–4M tokens per step (effective batch). You rarely fit that in one micro-batch on one GPU, so:
effective_batch_size = micro_batch × n_gpus × grad_accum_steps
grad_accum_steps accumulates gradients across forward/backward passes before the optimizer step:
opt.zero_grad()
for k in range(grad_accum_steps):
micro = next_batch()
loss = model(micro) / grad_accum_steps # divide so loss is averaged
loss.backward() # accumulates into .grad
opt.step()
This is mathematically equivalent to a single bigger batch (assuming no batch-norm — which transformers don't use).
4.2 The batch-size–LR coupling
When you increase the batch by k, you can usually increase the LR by k (linear scaling) or √k (sqrt scaling) without instability. For transformers the sqrt scaling is more conservative.
4.3 Critical batch size
McCandlish et al. (2018) showed each task has a critical batch size beyond which throughput improvements diminish. For LLMs the critical batch grows with model size — so you can use larger batches as you scale up.
5. Mixed precision
Goal: use lower-precision math to get more throughput per GPU and fit bigger models.
5.1 The four datatypes
| Type | Bits | Exponent | Mantissa | Notes |
|---|---|---|---|---|
| FP32 | 32 | 8 | 23 | Reference; "single precision" |
| FP16 | 16 | 5 | 10 | Tiny range; needs loss scaling |
| BF16 | 16 | 8 | 7 | Same range as FP32; loses mantissa precision |
| FP8 (E4M3) | 8 | 4 | 3 | H100+; needs per-tensor scaling |
| FP8 (E5M2) | 8 | 5 | 2 | Wider range; lower precision; gradients |
BF16 is the default for pretraining in 2024+. Same exponent range as FP32 means you don't need loss scaling. Mantissa precision is enough for most ops if you keep certain reductions in FP32.
5.2 The recipe (PyTorch AMP)
scaler = torch.cuda.amp.GradScaler() # FP16 path
with torch.amp.autocast("cuda", dtype=torch.bfloat16):
logits = model(x)
loss = F.cross_entropy(logits, y)
loss.backward()
opt.step()
opt.zero_grad()
For BF16 you don't need GradScaler. For FP16 you do, because FP16's tiny range (~6e-5 minimum normal) underflows easily; the scaler multiplies the loss by a large number to keep gradients in range, then unscales before the optimizer step.
5.3 FP8 on H100
Hopper TensorCores natively run FP8 matmul at 2× the rate of BF16. Used with per-tensor delayed scaling (or per-block scaling for finer granularity). Library: NVIDIA's transformer_engine. Phase 10 covers it more deeply.
6. Scaling laws — the most important paper of the era
6.1 Kaplan et al. (2020) — Scaling Laws for Neural Language Models
Loss as a function of compute, parameters, and data follows a clean power law:
$$ L(N) \approx (N_c / N)^{α_N} $$
(Same for D and C.) The bombshell was: at fixed compute C ≈ 6 N D, the optimal allocation favored bigger models. GPT-3 was sized accordingly: 175B params, ~300B tokens.
6.2 Chinchilla (Hoffmann et al., 2022) — Training Compute-Optimal Large Language Models
DeepMind redid the analysis carefully and found N and D should scale equally at fixed compute — i.e., ~20 tokens per parameter is optimal. Implication: GPT-3 was massively undertrained. The 70B Chinchilla model trained on 1.4T tokens beat the 280B Gopher trained on 300B tokens.
This single finding reshaped the field. Llama models train at 200×+ tokens per param (LLama-3 8B trained on 15T tokens — far beyond Chinchilla optimal but yields better inference economics).
6.3 The compute equation
For a dense transformer:
$$ C ≈ 6 N D \text{ FLOPs} $$
where N = non-embedding parameters, D = training tokens. The 6 comes from 2 (multiply-add) × 3 (forward + backward + optimizer-related). Useful for back-of-envelope cost estimates.
6.4 References
- Kaplan et al. (2020), Scaling Laws for Neural Language Models.
- Hoffmann et al. (2022), Training Compute-Optimal Large Language Models (Chinchilla).
- Henighan et al. (2020), Scaling Laws for Autoregressive Generative Modeling (multimodal).
- Hoffmann's Chinchilla follow-ups. Replications: Pearce et al. (2024).
7. Data preparation for pretraining
Phase 10 covers this in depth. Quick preview:
- Source: CommonCrawl (web), GitHub (code), arXiv (science), books, Wikipedia.
- Filter: language ID, quality classifier, Gopher rules, perplexity filter.
- Dedup: URL → exact → MinHash near-dup.
- PII scrub: regex + Presidio.
- Tokenize: with your tokenizer; output uint16 (vocab ≤ 65535) or uint32 .bin shards.
- Mix and shuffle: weighted source mixing, deterministic shuffle.
For Lab 02 (nanoGPT on TinyStories): step 5 only. The dataset is small and pre-cleaned.
8. The lab walkthrough (lab-02-nano-gpt)
8.1 What you'll build
A working prepare → train → sample CLI that:
- Prepare: downloads TinyStories (Eldan & Li, 2023; ~500MB of GPT-3.5-generated 4-year-old-level stories with vocabulary ~1500 words), tokenizes with GPT-2's tokenizer, dumps to
train.bin/val.bin(uint16 memory-mapped arrays). - Train: imports
MiniGPTandGPTConfigfrom Phase 4; trains formax_iters(default 5000) with bf16 AMP, gradient accumulation, cosine schedule. - Sample: loads checkpoint, runs autoregressive generation with top-k + temperature.
8.2 What "healthy" looks like
- Initial loss ≈
log(50257) ≈ 10.8. - After 100 steps: ~6 (model has learned unigram distribution).
- After 1000 steps: ~3 (basic word patterns).
- After 5000 steps on TinyStories with a 6-layer 384-dim model: ~1.5–2.0 (coherent simple stories).
8.3 Why memory-mapped uint16 .bin?
A 5GB tokenized corpus loaded into RAM = 5GB. As np.memmap, it costs ~0 — only the active page is in memory. Cheap random access for batch sampling. uint16 (2 bytes/token) halves disk vs uint32.
8.4 Things to read carefully
get_batch()— random offsets within the .bin, sliceblock_size + 1tokens, split into(x, y)with the +1 shift.- The training loop's
grad_accumarithmetic. - The cosine schedule with warmup function.
torch.amp.autocastplacement (only the forward; backward and optim step run in original precision).- The
@torch.no_grad()eval block — saves memory.
8.5 Cost expectation
On a single A100 40GB, the default config (~10M params, 5k steps, batch 64 × 256 tokens) trains in ~15–30 minutes. On consumer GPU (4090): ~30–60 minutes. Generates believable toddler stories.
9. Diagnosing training problems
| Symptom | Likely cause | Fix |
|---|---|---|
Loss stuck near log(V) | Model isn't training; requires_grad off, or LR=0 | Check optimizer.param_groups |
| Loss explodes to NaN at step 1 | Bad init; LR too high | Init check; lower LR; add warmup |
| Loss dropping then suddenly NaN | Single bad batch; FP16 underflow | Gradient clipping; switch to BF16 |
| Loss looks fine but generation is gibberish | Tokenizer mismatch; off-by-one in shift | Check decode of x[0] looks like text; verify y = x[1:] |
| Loss decreasing slowly | LR too low; batch too small | Raise LR; raise effective batch |
| Loss plateaus early | Undertrained or undersized | More tokens; bigger model |
| Eval loss diverges from train | Overfitting (rare in pretraining); data leak | More data; higher dropout (but transformers don't typically use dropout in pretraining) |
10. References
Core:
- Karpathy's
nanoGPTrepo and video lecture. - Kaplan et al. (2020) and Hoffmann et al. (2022) — scaling laws.
- Loshchilov & Hutter (2019), Decoupled Weight Decay Regularization (AdamW).
- Smith (2017), Cyclical Learning Rates for Training Neural Networks — LR range test.
- Eldan & Li (2023), TinyStories: How Small Can Language Models Be and Still Speak Coherent English?
Production-scale recipes (read once you finish the lab):
- OPT (Zhang et al. 2022) — has a release log of every restart and bug for a 175B model. Eye-opening.
- Llama-3 tech report.
- DeepSeek-V2 and DeepSeek-V3 tech reports.
- Qwen-2 tech report.
- Pythia (Biderman et al. 2023) — releases all checkpoints; great for studying training dynamics.
11. Common interview questions on Phase 5 material
- Why AdamW and not Adam?
- Why do we need LR warmup?
- What's the Chinchilla finding in one sentence? Why did it overturn Kaplan?
- How do you decide effective batch size?
- Walk me through gradient accumulation.
- Why BF16 over FP16 for pretraining?
- What does the AdamW optimizer cost in memory per parameter?
- Loss is NaN at step 200. How do you debug?
- You have $50k of compute. What size model and how many tokens?
- What's WSD and why is it interesting?
- Sketch the training loop on a whiteboard.
- How would you know if your model is undertrained?
12. From solid → exceptional
- Train nanoGPT on TinyStories. Then train on all of Wikipedia (~30GB tokenized). Document loss curves and final perplexity.
- Implement gradient checkpointing by hand (re-compute forward activations during backward instead of storing them). Measure the memory ↔ throughput tradeoff.
- Implement torch.compile wrapping; benchmark step time before/after.
- Add bf16 mixed precision with FP32 reductions explicitly (not via autocast); confirm equivalent loss.
- Read the OPT log book end-to-end; pick three failures and write what you would have done differently.
- Implement a scaling-law ablation: train models at sizes 6M, 12M, 25M, 50M for matched compute budgets; fit the power law; predict the loss at 100M; train and verify.
- Write a one-page cost model: $/M-tokens-trained for various model sizes on H100 spot.
13. Recommended cadence
| Day | Activity |
|---|---|
| Mon | Watch Karpathy's Let's reproduce GPT-2 (124M) video |
| Tue | Read Kaplan and Chinchilla papers |
| Wed | Lab 02 — get nanoGPT training; sample |
| Thu | Tune LR + batch; run 3 ablations; record loss curves |
| Fri | Add gradient checkpointing; benchmark |
| Sat | Read OPT log book; read Pythia paper |
| Sun | Mock-interview the 12 questions; whiteboard the training loop |