Capstone 01 — Mini-GPT Pretraining (100M params on 1B tokens)
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–4 weeks
Demonstrates end-to-end ownership of a real pretraining run: data prep → distributed training → eval → publishable artifact.
Goals
- Pretrain a ~100M-parameter decoder-only transformer on ~1B tokens of FineWeb-Edu (Chinchilla-optimal: tokens ≈ 20 × params).
- Run on multiple GPUs with FSDP (or DDP if 1× GPU fits).
- Ship a publishable artifact: a model card, a loss curve, a benchmark table, and a blog-style writeup.
- Track everything in Weights & Biases.
The point isn't to beat GPT-2 — it's to demonstrate you can run the entire pipeline competently and articulate every choice.
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Phase 10 lab → produces train_*.bin shards (uint16) │
└────────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ FSDP Trainer (PyTorch 2.x) │
│ - Mini-GPT (12 layers, d=768, 12 heads, ~110M params) │
│ - Mixed BF16 + grad checkpointing │
│ - Cosine LR with warmup, AdamW (β=0.9, 0.95), wd=0.1 │
│ - Grad accumulation → effective batch = 0.5M tokens │
│ - Eval every 1k steps on val + lm-eval-harness sample │
│ - Checkpoint every 5k steps (best + last) │
└────────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────────┐
│ Eval suite (each checkpoint): │
│ - val loss / perplexity │
│ - HellaSwag, ARC-Easy, PIQA (likelihood-based) │
│ - 5 free-form generations from fixed prompts (qualitative)│
└─────────────────────────────────────────────────────────────┘
Suggested Stack
| Component | Choice | Why |
|---|---|---|
| Framework | PyTorch 2.x | Standard for research |
| Distributed | FSDP (full-shard) | Memory-efficient; fits 100M+ on small GPUs |
| Data | FineWeb-Edu sample-10BT | High-quality web; HuggingFace HuggingFaceFW/fineweb-edu |
| Tokenizer | tiktoken gpt2 | 50257 vocab, fits uint16 shards |
| Logging | Weights & Biases | Industry standard; free for personal |
| Eval | lm-evaluation-harness | Reproducible, leaderboard-comparable |
| Compute | 4× A100 (cloud) or 2× 4090 (local) | ~24 GPU-hours for 1B tokens |
Deliverables Checklist
-
data/— preprocessed shards (or pointer to S3 bucket) -
model.py— your mini-GPT implementation (built on Phase 4 lab) -
train.py— FSDP training loop with all hyperparameters in a config -
configs/100m.yaml— exact hyperparameters -
eval/— eval harness wrapper that runs HellaSwag/ARC/PIQA per checkpoint -
MODEL_CARD.md— architecture, data, hyperparameters, intended use, limitations -
BENCHMARK.md— table of (checkpoint, val_loss, perplexity, HellaSwag, ARC, PIQA) -
LOSS_CURVE.png— exported from W&B -
SAMPLES.md— 5 fixed prompts + outputs at each major checkpoint (shows learning trajectory) -
WRITEUP.md— blog-style, ~2k words: motivation, choices, surprises, what you'd do differently - HuggingFace upload (optional but high signal): publish the final checkpoint with the model card
Resume Bullet Pattern
Pretrained a 110M-parameter decoder-only transformer on 1B tokens of FineWeb-Edu using PyTorch FSDP across 4× A100 GPUs. Achieved Chinchilla-optimal final val loss of 3.2 with reproducible eval suite (HellaSwag 0.34, ARC-E 0.45). Published model + writeup + W&B run. [link]
Interview Talking Points
- Chinchilla compute-optimality: why tokens ≈ 20× params and what happens when you violate it (over- vs under-trained).
- FSDP vs DDP vs ZeRO-3: parameter sharding strategies, communication volume, trade-offs.
- Mixed precision: BF16 vs FP16: dynamic range, GradScaler, why BF16 won on Ampere+.
- Learning rate schedule: why cosine, why warmup, how you tuned
lr_max. - Activation checkpointing: when it pays off (memory-bound) vs not (compute-bound).
- Eval quirks: likelihood scoring, length normalization, comparability across models.
- What you'd change with 10× compute: bigger model, longer context, RoPE, SwiGLU, FlashAttention-2.
Getting Started
- Run Phase-10 lab-02 end-to-end on a 10 GB CommonCrawl WET sample. Verify your shards load.
- Switch to FineWeb-Edu sample-10BT for the real run (already filtered/deduped).
- Implement FSDP wrapper:
FullyShardedDataParallel(model, auto_wrap_policy=transformer_auto_wrap_policy(...)). - Run a 100-step smoke test on a single GPU at full config; verify loss decreases.
- Scale to multi-GPU:
torchrun --nproc_per_node=4 train.py. Verify per-GPU memory and throughput. - Tune
lr_maxwith a learning-rate range test (small model, sweep across 1e-5 → 1e-2). - Launch the full run. Monitor W&B. Don't touch it for 24 hours.
- Run eval suite at each saved checkpoint. Build BENCHMARK.md.
- Write up what surprised you. Most interviews ask precisely this.