Capstone 01 — Mini-GPT Pretraining (100M params on 1B tokens)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–4 weeks

Demonstrates end-to-end ownership of a real pretraining run: data prep → distributed training → eval → publishable artifact.


Goals

  1. Pretrain a ~100M-parameter decoder-only transformer on ~1B tokens of FineWeb-Edu (Chinchilla-optimal: tokens ≈ 20 × params).
  2. Run on multiple GPUs with FSDP (or DDP if 1× GPU fits).
  3. Ship a publishable artifact: a model card, a loss curve, a benchmark table, and a blog-style writeup.
  4. Track everything in Weights & Biases.

The point isn't to beat GPT-2 — it's to demonstrate you can run the entire pipeline competently and articulate every choice.


Architecture

   ┌─────────────────────────────────────────────────────────────┐
   │  Phase 10 lab → produces train_*.bin shards (uint16)       │
   └────────────────────┬────────────────────────────────────────┘
                        ▼
   ┌─────────────────────────────────────────────────────────────┐
   │ FSDP Trainer (PyTorch 2.x)                                 │
   │  - Mini-GPT (12 layers, d=768, 12 heads, ~110M params)     │
   │  - Mixed BF16 + grad checkpointing                         │
   │  - Cosine LR with warmup, AdamW (β=0.9, 0.95), wd=0.1      │
   │  - Grad accumulation → effective batch = 0.5M tokens       │
   │  - Eval every 1k steps on val + lm-eval-harness sample     │
   │  - Checkpoint every 5k steps (best + last)                 │
   └────────────────────┬────────────────────────────────────────┘
                        ▼
   ┌─────────────────────────────────────────────────────────────┐
   │ Eval suite (each checkpoint):                              │
   │  - val loss / perplexity                                   │
   │  - HellaSwag, ARC-Easy, PIQA (likelihood-based)            │
   │  - 5 free-form generations from fixed prompts (qualitative)│
   └─────────────────────────────────────────────────────────────┘

Suggested Stack

ComponentChoiceWhy
FrameworkPyTorch 2.xStandard for research
DistributedFSDP (full-shard)Memory-efficient; fits 100M+ on small GPUs
DataFineWeb-Edu sample-10BTHigh-quality web; HuggingFace HuggingFaceFW/fineweb-edu
Tokenizertiktoken gpt250257 vocab, fits uint16 shards
LoggingWeights & BiasesIndustry standard; free for personal
Evallm-evaluation-harnessReproducible, leaderboard-comparable
Compute4× A100 (cloud) or 2× 4090 (local)~24 GPU-hours for 1B tokens

Deliverables Checklist

  • data/ — preprocessed shards (or pointer to S3 bucket)
  • model.py — your mini-GPT implementation (built on Phase 4 lab)
  • train.py — FSDP training loop with all hyperparameters in a config
  • configs/100m.yaml — exact hyperparameters
  • eval/ — eval harness wrapper that runs HellaSwag/ARC/PIQA per checkpoint
  • MODEL_CARD.md — architecture, data, hyperparameters, intended use, limitations
  • BENCHMARK.md — table of (checkpoint, val_loss, perplexity, HellaSwag, ARC, PIQA)
  • LOSS_CURVE.png — exported from W&B
  • SAMPLES.md — 5 fixed prompts + outputs at each major checkpoint (shows learning trajectory)
  • WRITEUP.md — blog-style, ~2k words: motivation, choices, surprises, what you'd do differently
  • HuggingFace upload (optional but high signal): publish the final checkpoint with the model card

Resume Bullet Pattern

Pretrained a 110M-parameter decoder-only transformer on 1B tokens of FineWeb-Edu using PyTorch FSDP across 4× A100 GPUs. Achieved Chinchilla-optimal final val loss of 3.2 with reproducible eval suite (HellaSwag 0.34, ARC-E 0.45). Published model + writeup + W&B run. [link]


Interview Talking Points

  • Chinchilla compute-optimality: why tokens ≈ 20× params and what happens when you violate it (over- vs under-trained).
  • FSDP vs DDP vs ZeRO-3: parameter sharding strategies, communication volume, trade-offs.
  • Mixed precision: BF16 vs FP16: dynamic range, GradScaler, why BF16 won on Ampere+.
  • Learning rate schedule: why cosine, why warmup, how you tuned lr_max.
  • Activation checkpointing: when it pays off (memory-bound) vs not (compute-bound).
  • Eval quirks: likelihood scoring, length normalization, comparability across models.
  • What you'd change with 10× compute: bigger model, longer context, RoPE, SwiGLU, FlashAttention-2.

Getting Started

  1. Run Phase-10 lab-02 end-to-end on a 10 GB CommonCrawl WET sample. Verify your shards load.
  2. Switch to FineWeb-Edu sample-10BT for the real run (already filtered/deduped).
  3. Implement FSDP wrapper: FullyShardedDataParallel(model, auto_wrap_policy=transformer_auto_wrap_policy(...)).
  4. Run a 100-step smoke test on a single GPU at full config; verify loss decreases.
  5. Scale to multi-GPU: torchrun --nproc_per_node=4 train.py. Verify per-GPU memory and throughput.
  6. Tune lr_max with a learning-rate range test (small model, sweep across 1e-5 → 1e-2).
  7. Launch the full run. Monitor W&B. Don't touch it for 24 hours.
  8. Run eval suite at each saved checkpoint. Build BENCHMARK.md.
  9. Write up what surprised you. Most interviews ask precisely this.