Phase 5 — Training Small LLMs

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2.5 weeks Roles supported: Research Engineer Pretraining, Foundation Model Engineer.


Why This Phase Exists

The Anthropic / OpenAI / DeepMind pretraining job descriptions all say variations of: "experience training transformer models end-to-end". Reading about it is not the same as having stared at a loss curve at 3 AM, debugged a NaN, and explained to yourself why your gradients exploded. This phase produces that experience cheaply.

By the end you will have trained a real (small) language model from scratch with a tokenizer you wrote, on data you cleaned, with a training loop you understand line-by-line.


Concepts

  • Byte-Pair Encoding (BPE) algorithm + GPT-2 / Llama tokenizer details
  • Tokenizer training: word frequencies → merges → vocab
  • nanoGPT-style architecture (Andrej Karpathy)
  • Dataset packing & sequence packing
  • Optimizers: AdamW, Lion, Sophia (overview)
  • Learning-rate schedules: warmup + cosine decay
  • Mixed precision: BF16 vs FP16, loss scaling
  • Gradient accumulation (simulating larger batch sizes)
  • Gradient clipping
  • Checkpointing strategy (save best, save last, save every N)
  • Sampling: greedy, multinomial, temperature, top-k, top-p (nucleus), beam, contrastive
  • Chinchilla scaling laws (intuition)
  • W&B / Tensorboard logging hygiene

Labs

Lab 01 — BPE Tokenizer From Scratch (Matching GPT-2)

FieldValue
GoalBuild a BPE tokenizer whose output matches tiktoken GPT-2 encoding byte-for-byte.
ConceptsBPE training algorithm, byte-level pre-tokenization, merges file format, special tokens.
Steps1) Implement byte-level pre-tokenization with GPT-2's regex. 2) Build word-frequency counter. 3) Implement merge-ranking loop. 4) Save vocab + merges. 5) Implement encode using the merges. 6) Round-trip test. 7) Compare token sequences against tiktoken.
StackPython stdlib, regex, tiktoken (only for validation)
DatasetsTinyStories sample (10 MB) for training the tokenizer
Outputbpe.py with train() / encode() / decode() and a vocab + merges file.
How to TestOn a held-out string, your encoder must produce the same token IDs as tiktoken GPT-2 on at least 95% of tokens (after vocab alignment).
Talking PointsWhy BPE beats word-level (OOV) and char-level (long sequences). Why byte-level. Common BPE pitfalls (whitespace handling).
Resume Bullet"Implemented byte-level BPE tokenizer from scratch matching tiktoken GPT-2 encoding on 95%+ of tokens across a held-out test corpus, including merge-ranking and vocab serialization."
ExtensionsTrain your own vocab from scratch on a domain corpus; compare to SentencePiece / Unigram.

Lab 02 — nanoGPT From Scratch on TinyStories

FieldValue
GoalTrain a 10–40M parameter decoder-only model from scratch on TinyStories.
ConceptsArchitecture wiring, dataset packing, training loop with logging, eval-on-val, sampling for qualitative inspection.
Steps1) Use Phase 4 transformer + Phase 5 Lab 1 tokenizer. 2) Stream-pack TinyStories into fixed-length sequences. 3) Configure d_model=256, n_layer=6, n_head=8 (~10M params). 4) AdamW, lr=3e-4, warmup 500, cosine to 3e-5. 5) Mixed precision BF16. 6) Log to W&B. 7) Save best checkpoint. 8) Generate stories with temperature/top-p sampling.
StackPyTorch 2.x, W&B, your tokenizer from Lab 1
DatasetsTinyStories (~2 GB) — train on a 200 MB subset
OutputA trained checkpoint (~50 MB), W&B run with loss curves, generated samples that read like coherent toddler stories.
How to TestTrain loss < 2.0, val perplexity < 8 on TinyStories val; generated stories are grammatical.
Talking PointsWhy TinyStories is the ideal "real" pretraining smoke test. Loss curve diagnostics (saturated, diverging, oscillating). Why warmup matters for AdamW + transformers.
Resume Bullet"Pre-trained a 28M-parameter decoder-only transformer from scratch on a 200 MB TinyStories slice using a custom BPE tokenizer, mixed-precision BF16, cosine LR schedule, and gradient accumulation; achieved val perplexity 6.9 in 4.2 GPU-hours on a single A100."
ExtensionsScale to 124M (GPT-2 small) on Lambda Labs spot for ~$10; add Chinchilla-optimal compute estimate.

Lab 03 — Training Loop Mechanics (Mixed Precision, Grad Accumulation, Checkpointing)

FieldValue
GoalAdd the four production-grade features that turn a toy loop into a real one.
Conceptstorch.amp.autocast + GradScaler (for FP16) vs native BF16; gradient accumulation math; gradient clipping; checkpoint atomicity.
Steps1) Wrap forward in autocast(dtype=torch.bfloat16). 2) Implement grad accumulation over N micro-steps. 3) nn.utils.clip_grad_norm_(model.parameters(), 1.0). 4) Atomic checkpoint save (save → fsync → rename). 5) Resumable training (load optimizer + RNG + step).
StackPyTorch
OutputA reusable trainer.py used by Phase 6 too.
How to TestResume produces identical loss within 1e-4 of an uninterrupted run.
Talking PointsWhy BF16 doesn't need GradScaler (wider dynamic range). Why we save optimizer state. Effective batch size = micro-batch × accum × world_size.
Resume Bullet"Authored a production-grade PyTorch training loop with BF16 mixed precision, gradient accumulation, atomic checkpointing, and bit-reproducible resume; verified deterministic loss replay within 1e-4."
ExtensionsAdd gradient checkpointing (activation recomputation) — relevant to Phase 10.

Lab 04 — Sampling Strategies & Generation

FieldValue
GoalImplement and compare 6 decoding strategies; understand quality/diversity tradeoffs.
ConceptsGreedy, multinomial, temperature, top-k, top-p (nucleus), beam search, contrastive search, repetition penalty.
Steps1) Implement each as a stateless function operating on logits. 2) Generate 50 samples per strategy from your nanoGPT. 3) Compute distinct-n metrics. 4) Plot quality (manual rating) vs diversity.
StackPyTorch
Outputsampling.py + a comparison report.
How to TestGreedy is deterministic; high temperature increases entropy of next-token distribution; top-p with p=1.0 reduces to multinomial.
Talking PointsWhy temperature alone is insufficient (rare tokens still leak). Why top-p > top-k for variable-entropy distributions. When beam search hurts (open-ended generation).
Resume Bullet"Implemented six LLM decoding strategies (greedy, multinomial, temperature, top-k, top-p, beam, contrastive) with quantitative diversity-vs-coherence comparison on a 28M-param model."
ExtensionsImplement speculative decoding (preview of Phase 9); implement constrained decoding with grammar (Outlines / lm-format-enforcer).

Deliverables Checklist

  • BPE tokenizer matching tiktoken on test data
  • nanoGPT trained on TinyStories with W&B logs and generated samples
  • Resumable training loop with grad accumulation + clipping
  • Sampling library + comparison report

Interview Relevance

  • "Walk me through your training loop"
  • "How would you debug a NaN loss?"
  • "Why BF16 over FP16?"
  • "Explain top-p sampling"
  • "How would you scale this to 1B parameters?" (sets up Phase 10)