Phase 10 — Distributed Training & Pretraining Data

Difficulty: ⭐⭐⭐⭐⭐ | Estimated Time: 2 weeks Roles supported: Pretraining Data Engineer, ML Infrastructure Engineer, Research Engineer Pretraining.


Why This Phase Exists

Anthropic's Pretraining Research Engineer role asks for "experience with distributed training" and "data pipeline engineering at scale". You will not have access to a thousand-GPU cluster — but you can demonstrate the principles with a 2-GPU FSDP run (rentable for a few dollars) and a real CommonCrawl-style data pipeline on 10–50 GB.

That is enough to answer the interview questions credibly and show production-quality artifacts.


Concepts

Distributed Training

  • Data Parallelism (DP / DDP) — replicate model, shard data
  • Fully Sharded Data Parallel (FSDP) — shard parameters, gradients, optimizer state
  • ZeRO-1 / ZeRO-2 / ZeRO-3 mapping to FSDP
  • Tensor Parallelism (Megatron-style) — overview
  • Pipeline Parallelism — overview
  • 3D parallelism composition
  • NCCL collectives: all-reduce, all-gather, reduce-scatter
  • Gradient checkpointing / activation recomputation
  • Mixed precision strategies in distributed setting
  • Communication-computation overlap

Pretraining Data

  • Source mixing: CommonCrawl, Wikipedia, books, code, papers
  • Quality filtering: language ID, perplexity-based, FastText classifier, heuristics (length, symbol ratio, gibberish detection)
  • Deduplication: exact, MinHash-LSH (near-dup), suffix array (SimHash overview)
  • Sequence packing for tokenization
  • Sharding strategy & shuffling
  • Contamination check against eval sets
  • Tokenization at scale (parallel)
  • Data ordering (curriculum) — overview

Labs

Lab 01 — DDP & FSDP Hands-On

FieldValue
GoalRun a real multi-GPU training experiment with DDP and FSDP; understand what is sharded.
ConceptsDistributed initialization, NCCL backend, gradient synchronization, FSDP wrap policy, mixed precision in distributed setting.
Steps1) Take your Phase 5 nanoGPT trainer. 2) Wrap with torch.nn.parallel.DistributedDataParallel. 3) Launch via torchrun --nproc_per_node=2. 4) Verify gradients sync (compare with single-GPU). 5) Switch to torch.distributed.fsdp.FullyShardedDataParallel with ShardingStrategy.FULL_SHARD. 6) Measure peak memory per rank.
StackPyTorch FSDP, NCCL; rent 2× T4 / A10 / A100 on Lambda / RunPod
OutputTwo-rank training run with W&B logs + a memory-comparison table (DDP vs FSDP).
How to TestLoss curves of DDP vs single-GPU should match within numerical noise; FSDP per-rank memory should be roughly half DDP for large models.
Talking PointsWhat FSDP shards (params + grads + opt state) and when to use ZeRO-3. NCCL all-reduce vs reduce-scatter+all-gather (FSDP's pattern). Communication overlap with backward.
Resume Bullet"Migrated a from-scratch nanoGPT trainer from single-GPU to 2× A100 FSDP (FULL_SHARD); verified loss-curve equivalence and demonstrated 47% per-rank memory reduction enabling 2.1× larger effective model."
ExtensionsAdd gradient checkpointing; profile with torch.profiler + Nsight; try DeepSpeed ZeRO-3 for comparison.

Lab 02 — Pretraining Data Pipeline (Dedup + Filter + Tokenize)

FieldValue
GoalBuild a real pretraining data pipeline processing 10+ GB of raw web text into clean, deduped, tokenized shards.
ConceptsSource ingestion (WET files), language ID, quality filtering, MinHash-LSH near-dup, tokenization at scale, sharding.
Steps1) Download a few CommonCrawl WET shards (~10 GB). 2) Parse with warcio. 3) Language-filter with fasttext lid. 4) Quality filter with heuristics (length, symbol ratio, repetition). 5) MinHash-LSH dedup with datasketch. 6) Tokenize with your Phase 5 BPE in parallel. 7) Write to .bin shards. 8) Produce a pipeline report (input bytes → output tokens, drop rate per stage).
Stackwarcio, fasttext, datasketch (MinHash), polars or dask, multiprocessing
DatasetsCommonCrawl WET shards — pick a few from the latest crawl
OutputA Snakemake or Prefect DAG, training-ready binary shards, a pipeline report.
How to TestToken counts match expected; spot-check 100 random documents for quality; dedup actually removes duplicates (insert known dups, verify removal).
Talking PointsWhy MinHash-LSH (sublinear near-dup detection). Why FastText lid. Why heuristic filters > learned filters at this scale (cheap + good enough). Source-mixing strategy (Pile, RedPajama recipes).
Resume Bullet"Built a CommonCrawl pretraining data pipeline (warcio → FastText lid → quality heuristics → MinHash-LSH dedup → BPE tokenization) processing 12 GB of WET into 3.8 GB of training-ready tokens with reproducible Snakemake DAG and per-stage drop-rate report."
ExtensionsAdd a perplexity-based quality filter using your Phase 5 model; add a contamination check against MMLU/HellaSwag test sets.

Lab 03 — Checkpointing & Resumability

FieldValue
GoalBuild production-grade checkpointing for distributed training.
ConceptsSharded vs full checkpoints, async checkpointing, atomic writes, RNG state, dataloader state.
Steps1) Use FSDP state_dict_type to save sharded checkpoints. 2) Save optimizer + RNG + dataloader step. 3) Verify resume produces identical loss to uninterrupted run. 4) Add periodic + best + final checkpoint logic.
StackPyTorch FSDP, your Phase 5/10 trainer
OutputA checkpoint.py module + a resume-determinism test report.
How to TestResumed loss within 1e-4 of original.
Talking PointsWhy sharded checkpoints (storage IO scales). Async checkpointing (overlap save with training).
Resume Bullet"Implemented FSDP sharded checkpointing with RNG + dataloader state preservation; verified bit-reproducible resume on a multi-rank training job."
ExtensionsAdd cloud-storage upload (S3 / GCS) with multipart + retries.

Lab 04 — Observability & Monitoring for LLM Systems

FieldValue
GoalAdd structured observability to your Phase 9 inference server.
ConceptsOpenTelemetry traces, token-level metrics, request lifecycle, drift detection.
Steps1) Instrument FastAPI with OpenTelemetry. 2) Emit per-request: TTFT, TPOT, total tokens, queue time, GPU utilization. 3) Export to Prometheus. 4) Build Grafana dashboard. 5) Add a daily eval-in-prod job (run a small canary eval set against the deployed model and alert on regression).
StackOpenTelemetry, Prometheus, Grafana, your Phase 9 server
OutputA live dashboard + alerting rules + a canary-eval cron.
How to TestTrigger a regression (swap in a worse model) and verify alert fires.
Talking PointsWhat to monitor for LLMs that classical APM misses. The drift problem and how to catch it.
Resume Bullet"Instrumented an LLM inference service with OpenTelemetry traces and Prometheus metrics (TTFT, TPOT, queue depth, KV-cache utilization); built Grafana dashboard and a daily canary-eval regression alert."
ExtensionsAdd Langfuse for prompt-level tracing; add cost dashboarding ($/req).

Deliverables Checklist

  • 2-GPU FSDP run with W&B logs + memory comparison
  • CommonCrawl pipeline producing deduped, filtered, tokenized shards
  • Sharded resumable checkpointing
  • Inference observability stack with canary eval

Interview Relevance

  • "Walk me through ZeRO-3 / FSDP"
  • "How would you build a pretraining data pipeline?"
  • "What are the bottlenecks in distributed training?"
  • "How would you monitor an LLM in production?"