The Anthropic / OpenAI / DeepMind pretraining job descriptions all say variations of: "experience training transformer models end-to-end". Reading about it is not the same as having stared at a loss curve at 3 AM, debugged a NaN, and explained to yourself why your gradients exploded. This phase produces that experience cheaply.
By the end you will have trained a real (small) language model from scratch with a tokenizer you wrote, on data you cleaned, with a training loop you understand line-by-line.
BPE training algorithm, byte-level pre-tokenization, merges file format, special tokens.
Steps
1) Implement byte-level pre-tokenization with GPT-2's regex. 2) Build word-frequency counter. 3) Implement merge-ranking loop. 4) Save vocab + merges. 5) Implement encode using the merges. 6) Round-trip test. 7) Compare token sequences against tiktoken.
Stack
Python stdlib, regex, tiktoken (only for validation)
Datasets
TinyStories sample (10 MB) for training the tokenizer
Output
bpe.py with train() / encode() / decode() and a vocab + merges file.
How to Test
On a held-out string, your encoder must produce the same token IDs as tiktoken GPT-2 on at least 95% of tokens (after vocab alignment).
Talking Points
Why BPE beats word-level (OOV) and char-level (long sequences). Why byte-level. Common BPE pitfalls (whitespace handling).
Resume Bullet
"Implemented byte-level BPE tokenizer from scratch matching tiktoken GPT-2 encoding on 95%+ of tokens across a held-out test corpus, including merge-ranking and vocab serialization."
Extensions
Train your own vocab from scratch on a domain corpus; compare to SentencePiece / Unigram.
A trained checkpoint (~50 MB), W&B run with loss curves, generated samples that read like coherent toddler stories.
How to Test
Train loss < 2.0, val perplexity < 8 on TinyStories val; generated stories are grammatical.
Talking Points
Why TinyStories is the ideal "real" pretraining smoke test. Loss curve diagnostics (saturated, diverging, oscillating). Why warmup matters for AdamW + transformers.
Resume Bullet
"Pre-trained a 28M-parameter decoder-only transformer from scratch on a 200 MB TinyStories slice using a custom BPE tokenizer, mixed-precision BF16, cosine LR schedule, and gradient accumulation; achieved val perplexity 6.9 in 4.2 GPU-hours on a single A100."
Extensions
Scale to 124M (GPT-2 small) on Lambda Labs spot for ~$10; add Chinchilla-optimal compute estimate.
1) Wrap forward in autocast(dtype=torch.bfloat16). 2) Implement grad accumulation over N micro-steps. 3) nn.utils.clip_grad_norm_(model.parameters(), 1.0). 4) Atomic checkpoint save (save → fsync → rename). 5) Resumable training (load optimizer + RNG + step).
Stack
PyTorch
Output
A reusable trainer.py used by Phase 6 too.
How to Test
Resume produces identical loss within 1e-4 of an uninterrupted run.
Talking Points
Why BF16 doesn't need GradScaler (wider dynamic range). Why we save optimizer state. Effective batch size = micro-batch × accum × world_size.
Resume Bullet
"Authored a production-grade PyTorch training loop with BF16 mixed precision, gradient accumulation, atomic checkpointing, and bit-reproducible resume; verified deterministic loss replay within 1e-4."
Extensions
Add gradient checkpointing (activation recomputation) — relevant to Phase 10.
1) Implement each as a stateless function operating on logits. 2) Generate 50 samples per strategy from your nanoGPT. 3) Compute distinct-n metrics. 4) Plot quality (manual rating) vs diversity.
Stack
PyTorch
Output
sampling.py + a comparison report.
How to Test
Greedy is deterministic; high temperature increases entropy of next-token distribution; top-p with p=1.0 reduces to multinomial.
Talking Points
Why temperature alone is insufficient (rare tokens still leak). Why top-p > top-k for variable-entropy distributions. When beam search hurts (open-ended generation).
Resume Bullet
"Implemented six LLM decoding strategies (greedy, multinomial, temperature, top-k, top-p, beam, contrastive) with quantitative diversity-vs-coherence comparison on a 28M-param model."
Extensions
Implement speculative decoding (preview of Phase 9); implement constrained decoding with grammar (Outlines / lm-format-enforcer).