Capstone 01 — Mini-GPT Pretraining (100M params on 1B tokens)

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐⭐ | Time: 2–4 weeks

Demonstrates end-to-end ownership of a real pretraining run: data prep → distributed training → eval → publishable artifact.

Goals

Pretrain a ~100M-parameter decoder-only transformer on ~1B tokens of FineWeb-Edu (Chinchilla-optimal: tokens ≈ 20 × params).
Run on multiple GPUs with FSDP (or DDP if 1× GPU fits).
Ship a publishable artifact: a model card, a loss curve, a benchmark table, and a blog-style writeup.
Track everything in Weights & Biases.

The point isn't to beat GPT-2 — it's to demonstrate you can run the entire pipeline competently and articulate every choice.

Architecture

   ┌─────────────────────────────────────────────────────────────┐
   │  Phase 10 lab → produces train_*.bin shards (uint16)       │
   └────────────────────┬────────────────────────────────────────┘
                        ▼
   ┌─────────────────────────────────────────────────────────────┐
   │ FSDP Trainer (PyTorch 2.x)                                 │
   │  - Mini-GPT (12 layers, d=768, 12 heads, ~110M params)     │
   │  - Mixed BF16 + grad checkpointing                         │
   │  - Cosine LR with warmup, AdamW (β=0.9, 0.95), wd=0.1      │
   │  - Grad accumulation → effective batch = 0.5M tokens       │
   │  - Eval every 1k steps on val + lm-eval-harness sample     │
   │  - Checkpoint every 5k steps (best + last)                 │
   └────────────────────┬────────────────────────────────────────┘
                        ▼
   ┌─────────────────────────────────────────────────────────────┐
   │ Eval suite (each checkpoint):                              │
   │  - val loss / perplexity                                   │
   │  - HellaSwag, ARC-Easy, PIQA (likelihood-based)            │
   │  - 5 free-form generations from fixed prompts (qualitative)│
   └─────────────────────────────────────────────────────────────┘

Suggested Stack

Component	Choice	Why
Framework	PyTorch 2.x	Standard for research
Distributed	FSDP (full-shard)	Memory-efficient; fits 100M+ on small GPUs
Data	FineWeb-Edu sample-10BT	High-quality web; HuggingFace `HuggingFaceFW/fineweb-edu`
Tokenizer	tiktoken `gpt2`	50257 vocab, fits uint16 shards
Logging	Weights & Biases	Industry standard; free for personal
Eval	`lm-evaluation-harness`	Reproducible, leaderboard-comparable
Compute	4× A100 (cloud) or 2× 4090 (local)	~24 GPU-hours for 1B tokens

Deliverables Checklist

data/ — preprocessed shards (or pointer to S3 bucket)
model.py — your mini-GPT implementation (built on Phase 4 lab)
train.py — FSDP training loop with all hyperparameters in a config
configs/100m.yaml — exact hyperparameters
eval/ — eval harness wrapper that runs HellaSwag/ARC/PIQA per checkpoint
MODEL_CARD.md — architecture, data, hyperparameters, intended use, limitations
BENCHMARK.md — table of (checkpoint, val_loss, perplexity, HellaSwag, ARC, PIQA)
LOSS_CURVE.png — exported from W&B
SAMPLES.md — 5 fixed prompts + outputs at each major checkpoint (shows learning trajectory)
WRITEUP.md — blog-style, ~2k words: motivation, choices, surprises, what you'd do differently
HuggingFace upload (optional but high signal): publish the final checkpoint with the model card

Resume Bullet Pattern

Pretrained a 110M-parameter decoder-only transformer on 1B tokens of FineWeb-Edu using PyTorch FSDP across 4× A100 GPUs. Achieved Chinchilla-optimal final val loss of 3.2 with reproducible eval suite (HellaSwag 0.34, ARC-E 0.45). Published model + writeup + W&B run. [link]

Interview Talking Points

Chinchilla compute-optimality: why tokens ≈ 20× params and what happens when you violate it (over- vs under-trained).
FSDP vs DDP vs ZeRO-3: parameter sharding strategies, communication volume, trade-offs.
Mixed precision: BF16 vs FP16: dynamic range, GradScaler, why BF16 won on Ampere+.
Learning rate schedule: why cosine, why warmup, how you tuned lr_max.
Activation checkpointing: when it pays off (memory-bound) vs not (compute-bound).
Eval quirks: likelihood scoring, length normalization, comparability across models.
What you'd change with 10× compute: bigger model, longer context, RoPE, SwiGLU, FlashAttention-2.

Getting Started

Run Phase-10 lab-02 end-to-end on a 10 GB CommonCrawl WET sample. Verify your shards load.
Switch to FineWeb-Edu sample-10BT for the real run (already filtered/deduped).
Implement FSDP wrapper: FullyShardedDataParallel(model, auto_wrap_policy=transformer_auto_wrap_policy(...)).
Run a 100-step smoke test on a single GPU at full config; verify loss decreases.
Scale to multi-GPU: torchrun --nproc_per_node=4 train.py. Verify per-GPU memory and throughput.
Tune lr_max with a learning-rate range test (small model, sweep across 1e-5 → 1e-2).
Launch the full run. Monitor W&B. Don't touch it for 24 hours.
Run eval suite at each saved checkpoint. Build BENCHMARK.md.
Write up what surprised you. Most interviews ask precisely this.

LLM Inference Engineer