Phase 10 — Distributed Training & Pretraining Data

Difficulty: ⭐⭐⭐⭐⭐ | Estimated Time: 2 weeks Roles supported: Pretraining Data Engineer, ML Infrastructure Engineer, Research Engineer Pretraining.

Why This Phase Exists

Anthropic's Pretraining Research Engineer role asks for "experience with distributed training" and "data pipeline engineering at scale". You will not have access to a thousand-GPU cluster — but you can demonstrate the principles with a 2-GPU FSDP run (rentable for a few dollars) and a real CommonCrawl-style data pipeline on 10–50 GB.

That is enough to answer the interview questions credibly and show production-quality artifacts.

Concepts

Distributed Training

Data Parallelism (DP / DDP) — replicate model, shard data
Fully Sharded Data Parallel (FSDP) — shard parameters, gradients, optimizer state
ZeRO-1 / ZeRO-2 / ZeRO-3 mapping to FSDP
Tensor Parallelism (Megatron-style) — overview
Pipeline Parallelism — overview
3D parallelism composition
NCCL collectives: all-reduce, all-gather, reduce-scatter
Gradient checkpointing / activation recomputation
Mixed precision strategies in distributed setting
Communication-computation overlap

Pretraining Data

Source mixing: CommonCrawl, Wikipedia, books, code, papers
Quality filtering: language ID, perplexity-based, FastText classifier, heuristics (length, symbol ratio, gibberish detection)
Deduplication: exact, MinHash-LSH (near-dup), suffix array (SimHash overview)
Sequence packing for tokenization
Sharding strategy & shuffling
Contamination check against eval sets
Tokenization at scale (parallel)
Data ordering (curriculum) — overview

Labs

Lab 01 — DDP & FSDP Hands-On

Field	Value
Goal	Run a real multi-GPU training experiment with DDP and FSDP; understand what is sharded.
Concepts	Distributed initialization, NCCL backend, gradient synchronization, FSDP wrap policy, mixed precision in distributed setting.
Steps	1) Take your Phase 5 nanoGPT trainer. 2) Wrap with `torch.nn.parallel.DistributedDataParallel`. 3) Launch via `torchrun --nproc_per_node=2`. 4) Verify gradients sync (compare with single-GPU). 5) Switch to `torch.distributed.fsdp.FullyShardedDataParallel` with `ShardingStrategy.FULL_SHARD`. 6) Measure peak memory per rank.
Stack	PyTorch FSDP, NCCL; rent 2× T4 / A10 / A100 on Lambda / RunPod
Output	Two-rank training run with W&B logs + a memory-comparison table (DDP vs FSDP).
How to Test	Loss curves of DDP vs single-GPU should match within numerical noise; FSDP per-rank memory should be roughly half DDP for large models.
Talking Points	What FSDP shards (params + grads + opt state) and when to use ZeRO-3. NCCL all-reduce vs reduce-scatter+all-gather (FSDP's pattern). Communication overlap with backward.
Resume Bullet	"Migrated a from-scratch nanoGPT trainer from single-GPU to 2× A100 FSDP (`FULL_SHARD`); verified loss-curve equivalence and demonstrated 47% per-rank memory reduction enabling 2.1× larger effective model."
Extensions	Add gradient checkpointing; profile with `torch.profiler` + Nsight; try DeepSpeed ZeRO-3 for comparison.

Lab 02 — Pretraining Data Pipeline (Dedup + Filter + Tokenize)

Field	Value
Goal	Build a real pretraining data pipeline processing 10+ GB of raw web text into clean, deduped, tokenized shards.
Concepts	Source ingestion (WET files), language ID, quality filtering, MinHash-LSH near-dup, tokenization at scale, sharding.
Steps	1) Download a few CommonCrawl WET shards (~10 GB). 2) Parse with `warcio`. 3) Language-filter with `fasttext` lid. 4) Quality filter with heuristics (length, symbol ratio, repetition). 5) MinHash-LSH dedup with `datasketch`. 6) Tokenize with your Phase 5 BPE in parallel. 7) Write to `.bin` shards. 8) Produce a pipeline report (input bytes → output tokens, drop rate per stage).
Stack	`warcio`, `fasttext`, `datasketch` (MinHash), `polars` or `dask`, `multiprocessing`
Datasets	CommonCrawl WET shards — pick a few from the latest crawl
Output	A Snakemake or Prefect DAG, training-ready binary shards, a pipeline report.
How to Test	Token counts match expected; spot-check 100 random documents for quality; dedup actually removes duplicates (insert known dups, verify removal).
Talking Points	Why MinHash-LSH (sublinear near-dup detection). Why FastText lid. Why heuristic filters > learned filters at this scale (cheap + good enough). Source-mixing strategy (Pile, RedPajama recipes).
Resume Bullet	"Built a CommonCrawl pretraining data pipeline (warcio → FastText lid → quality heuristics → MinHash-LSH dedup → BPE tokenization) processing 12 GB of WET into 3.8 GB of training-ready tokens with reproducible Snakemake DAG and per-stage drop-rate report."
Extensions	Add a perplexity-based quality filter using your Phase 5 model; add a contamination check against MMLU/HellaSwag test sets.

Lab 03 — Checkpointing & Resumability

Field	Value
Goal	Build production-grade checkpointing for distributed training.
Concepts	Sharded vs full checkpoints, async checkpointing, atomic writes, RNG state, dataloader state.
Steps	1) Use FSDP `state_dict_type` to save sharded checkpoints. 2) Save optimizer + RNG + dataloader step. 3) Verify resume produces identical loss to uninterrupted run. 4) Add periodic + best + final checkpoint logic.
Stack	PyTorch FSDP, your Phase 5/10 trainer
Output	A `checkpoint.py` module + a resume-determinism test report.
How to Test	Resumed loss within 1e-4 of original.
Talking Points	Why sharded checkpoints (storage IO scales). Async checkpointing (overlap save with training).
Resume Bullet	"Implemented FSDP sharded checkpointing with RNG + dataloader state preservation; verified bit-reproducible resume on a multi-rank training job."
Extensions	Add cloud-storage upload (S3 / GCS) with multipart + retries.

Lab 04 — Observability & Monitoring for LLM Systems

Field	Value
Goal	Add structured observability to your Phase 9 inference server.
Concepts	OpenTelemetry traces, token-level metrics, request lifecycle, drift detection.
Steps	1) Instrument FastAPI with OpenTelemetry. 2) Emit per-request: TTFT, TPOT, total tokens, queue time, GPU utilization. 3) Export to Prometheus. 4) Build Grafana dashboard. 5) Add a daily eval-in-prod job (run a small canary eval set against the deployed model and alert on regression).
Stack	OpenTelemetry, Prometheus, Grafana, your Phase 9 server
Output	A live dashboard + alerting rules + a canary-eval cron.
How to Test	Trigger a regression (swap in a worse model) and verify alert fires.
Talking Points	What to monitor for LLMs that classical APM misses. The drift problem and how to catch it.
Resume Bullet	"Instrumented an LLM inference service with OpenTelemetry traces and Prometheus metrics (TTFT, TPOT, queue depth, KV-cache utilization); built Grafana dashboard and a daily canary-eval regression alert."
Extensions	Add Langfuse for prompt-level tracing; add cost dashboarding ($/req).

Deliverables Checklist

2-GPU FSDP run with W&B logs + memory comparison
CommonCrawl pipeline producing deduped, filtered, tokenized shards
Sharded resumable checkpointing
Inference observability stack with canary eval

Interview Relevance

"Walk me through ZeRO-3 / FSDP"
"How would you build a pretraining data pipeline?"
"What are the bottlenecks in distributed training?"
"How would you monitor an LLM in production?"

LLM Inference Engineer