Phase 5 — Training Small LLMs

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2.5 weeks Roles supported: Research Engineer Pretraining, Foundation Model Engineer.

Why This Phase Exists

The Anthropic / OpenAI / DeepMind pretraining job descriptions all say variations of: "experience training transformer models end-to-end". Reading about it is not the same as having stared at a loss curve at 3 AM, debugged a NaN, and explained to yourself why your gradients exploded. This phase produces that experience cheaply.

By the end you will have trained a real (small) language model from scratch with a tokenizer you wrote, on data you cleaned, with a training loop you understand line-by-line.

Concepts

Byte-Pair Encoding (BPE) algorithm + GPT-2 / Llama tokenizer details
Tokenizer training: word frequencies → merges → vocab
nanoGPT-style architecture (Andrej Karpathy)
Dataset packing & sequence packing
Optimizers: AdamW, Lion, Sophia (overview)
Learning-rate schedules: warmup + cosine decay
Mixed precision: BF16 vs FP16, loss scaling
Gradient accumulation (simulating larger batch sizes)
Gradient clipping
Checkpointing strategy (save best, save last, save every N)
Sampling: greedy, multinomial, temperature, top-k, top-p (nucleus), beam, contrastive
Chinchilla scaling laws (intuition)
W&B / Tensorboard logging hygiene

Labs

Lab 01 — BPE Tokenizer From Scratch (Matching GPT-2)

Field	Value
Goal	Build a BPE tokenizer whose output matches `tiktoken` GPT-2 encoding byte-for-byte.
Concepts	BPE training algorithm, byte-level pre-tokenization, merges file format, special tokens.
Steps	1) Implement byte-level pre-tokenization with GPT-2's regex. 2) Build word-frequency counter. 3) Implement merge-ranking loop. 4) Save vocab + merges. 5) Implement encode using the merges. 6) Round-trip test. 7) Compare token sequences against `tiktoken`.
Stack	Python stdlib, `regex`, `tiktoken` (only for validation)
Datasets	TinyStories sample (10 MB) for training the tokenizer
Output	`bpe.py` with `train()` / `encode()` / `decode()` and a vocab + merges file.
How to Test	On a held-out string, your encoder must produce the same token IDs as `tiktoken` GPT-2 on at least 95% of tokens (after vocab alignment).
Talking Points	Why BPE beats word-level (OOV) and char-level (long sequences). Why byte-level. Common BPE pitfalls (whitespace handling).
Resume Bullet	"Implemented byte-level BPE tokenizer from scratch matching `tiktoken` GPT-2 encoding on 95%+ of tokens across a held-out test corpus, including merge-ranking and vocab serialization."
Extensions	Train your own vocab from scratch on a domain corpus; compare to SentencePiece / Unigram.

Lab 02 — nanoGPT From Scratch on TinyStories

Field	Value
Goal	Train a 10–40M parameter decoder-only model from scratch on TinyStories.
Concepts	Architecture wiring, dataset packing, training loop with logging, eval-on-val, sampling for qualitative inspection.
Steps	1) Use Phase 4 transformer + Phase 5 Lab 1 tokenizer. 2) Stream-pack TinyStories into fixed-length sequences. 3) Configure d_model=256, n_layer=6, n_head=8 (~10M params). 4) AdamW, lr=3e-4, warmup 500, cosine to 3e-5. 5) Mixed precision BF16. 6) Log to W&B. 7) Save best checkpoint. 8) Generate stories with temperature/top-p sampling.
Stack	PyTorch 2.x, W&B, your tokenizer from Lab 1
Datasets	TinyStories (~2 GB) — train on a 200 MB subset
Output	A trained checkpoint (~50 MB), W&B run with loss curves, generated samples that read like coherent toddler stories.
How to Test	Train loss < 2.0, val perplexity < 8 on TinyStories val; generated stories are grammatical.
Talking Points	Why TinyStories is the ideal "real" pretraining smoke test. Loss curve diagnostics (saturated, diverging, oscillating). Why warmup matters for AdamW + transformers.
Resume Bullet	"Pre-trained a 28M-parameter decoder-only transformer from scratch on a 200 MB TinyStories slice using a custom BPE tokenizer, mixed-precision BF16, cosine LR schedule, and gradient accumulation; achieved val perplexity 6.9 in 4.2 GPU-hours on a single A100."
Extensions	Scale to 124M (GPT-2 small) on Lambda Labs spot for ~$10; add Chinchilla-optimal compute estimate.

Lab 03 — Training Loop Mechanics (Mixed Precision, Grad Accumulation, Checkpointing)

Field	Value
Goal	Add the four production-grade features that turn a toy loop into a real one.
Concepts	`torch.amp.autocast` + `GradScaler` (for FP16) vs native BF16; gradient accumulation math; gradient clipping; checkpoint atomicity.
Steps	1) Wrap forward in `autocast(dtype=torch.bfloat16)`. 2) Implement grad accumulation over N micro-steps. 3) `nn.utils.clip_grad_norm_(model.parameters(), 1.0)`. 4) Atomic checkpoint save (`save → fsync → rename`). 5) Resumable training (load optimizer + RNG + step).
Stack	PyTorch
Output	A reusable `trainer.py` used by Phase 6 too.
How to Test	Resume produces identical loss within 1e-4 of an uninterrupted run.
Talking Points	Why BF16 doesn't need GradScaler (wider dynamic range). Why we save optimizer state. Effective batch size = micro-batch × accum × world_size.
Resume Bullet	"Authored a production-grade PyTorch training loop with BF16 mixed precision, gradient accumulation, atomic checkpointing, and bit-reproducible resume; verified deterministic loss replay within 1e-4."
Extensions	Add gradient checkpointing (activation recomputation) — relevant to Phase 10.

Lab 04 — Sampling Strategies & Generation

Field	Value
Goal	Implement and compare 6 decoding strategies; understand quality/diversity tradeoffs.
Concepts	Greedy, multinomial, temperature, top-k, top-p (nucleus), beam search, contrastive search, repetition penalty.
Steps	1) Implement each as a stateless function operating on logits. 2) Generate 50 samples per strategy from your nanoGPT. 3) Compute distinct-n metrics. 4) Plot quality (manual rating) vs diversity.
Stack	PyTorch
Output	`sampling.py` + a comparison report.
How to Test	Greedy is deterministic; high temperature increases entropy of next-token distribution; top-p with p=1.0 reduces to multinomial.
Talking Points	Why temperature alone is insufficient (rare tokens still leak). Why top-p > top-k for variable-entropy distributions. When beam search hurts (open-ended generation).
Resume Bullet	"Implemented six LLM decoding strategies (greedy, multinomial, temperature, top-k, top-p, beam, contrastive) with quantitative diversity-vs-coherence comparison on a 28M-param model."
Extensions	Implement speculative decoding (preview of Phase 9); implement constrained decoding with grammar (Outlines / lm-format-enforcer).

Deliverables Checklist

BPE tokenizer matching tiktoken on test data
nanoGPT trained on TinyStories with W&B logs and generated samples
Resumable training loop with grad accumulation + clipping
Sampling library + comparison report

Interview Relevance

"Walk me through your training loop"
"How would you debug a NaN loss?"
"Why BF16 over FP16?"
"Explain top-p sampling"
"How would you scale this to 1B parameters?" (sets up Phase 10)

LLM Inference Engineer