Anthropic's Pretraining Research Engineer role asks for "experience with distributed training" and "data pipeline engineering at scale". You will not have access to a thousand-GPU cluster — but you can demonstrate the principles with a 2-GPU FSDP run (rentable for a few dollars) and a real CommonCrawl-style data pipeline on 10–50 GB.
That is enough to answer the interview questions credibly and show production-quality artifacts.
1) Take your Phase 5 nanoGPT trainer. 2) Wrap with torch.nn.parallel.DistributedDataParallel. 3) Launch via torchrun --nproc_per_node=2. 4) Verify gradients sync (compare with single-GPU). 5) Switch to torch.distributed.fsdp.FullyShardedDataParallel with ShardingStrategy.FULL_SHARD. 6) Measure peak memory per rank.
Two-rank training run with W&B logs + a memory-comparison table (DDP vs FSDP).
How to Test
Loss curves of DDP vs single-GPU should match within numerical noise; FSDP per-rank memory should be roughly half DDP for large models.
Talking Points
What FSDP shards (params + grads + opt state) and when to use ZeRO-3. NCCL all-reduce vs reduce-scatter+all-gather (FSDP's pattern). Communication overlap with backward.
Resume Bullet
"Migrated a from-scratch nanoGPT trainer from single-GPU to 2× A100 FSDP (FULL_SHARD); verified loss-curve equivalence and demonstrated 47% per-rank memory reduction enabling 2.1× larger effective model."
Extensions
Add gradient checkpointing; profile with torch.profiler + Nsight; try DeepSpeed ZeRO-3 for comparison.
Build a real pretraining data pipeline processing 10+ GB of raw web text into clean, deduped, tokenized shards.
Concepts
Source ingestion (WET files), language ID, quality filtering, MinHash-LSH near-dup, tokenization at scale, sharding.
Steps
1) Download a few CommonCrawl WET shards (~10 GB). 2) Parse with warcio. 3) Language-filter with fasttext lid. 4) Quality filter with heuristics (length, symbol ratio, repetition). 5) MinHash-LSH dedup with datasketch. 6) Tokenize with your Phase 5 BPE in parallel. 7) Write to .bin shards. 8) Produce a pipeline report (input bytes → output tokens, drop rate per stage).
Stack
warcio, fasttext, datasketch (MinHash), polars or dask, multiprocessing
A Snakemake or Prefect DAG, training-ready binary shards, a pipeline report.
How to Test
Token counts match expected; spot-check 100 random documents for quality; dedup actually removes duplicates (insert known dups, verify removal).
Talking Points
Why MinHash-LSH (sublinear near-dup detection). Why FastText lid. Why heuristic filters > learned filters at this scale (cheap + good enough). Source-mixing strategy (Pile, RedPajama recipes).
Resume Bullet
"Built a CommonCrawl pretraining data pipeline (warcio → FastText lid → quality heuristics → MinHash-LSH dedup → BPE tokenization) processing 12 GB of WET into 3.8 GB of training-ready tokens with reproducible Snakemake DAG and per-stage drop-rate report."
Extensions
Add a perplexity-based quality filter using your Phase 5 model; add a contamination check against MMLU/HellaSwag test sets.
Build production-grade checkpointing for distributed training.
Concepts
Sharded vs full checkpoints, async checkpointing, atomic writes, RNG state, dataloader state.
Steps
1) Use FSDP state_dict_type to save sharded checkpoints. 2) Save optimizer + RNG + dataloader step. 3) Verify resume produces identical loss to uninterrupted run. 4) Add periodic + best + final checkpoint logic.
Stack
PyTorch FSDP, your Phase 5/10 trainer
Output
A checkpoint.py module + a resume-determinism test report.
How to Test
Resumed loss within 1e-4 of original.
Talking Points
Why sharded checkpoints (storage IO scales). Async checkpointing (overlap save with training).
Resume Bullet
"Implemented FSDP sharded checkpointing with RNG + dataloader state preservation; verified bit-reproducible resume on a multi-rank training job."
Extensions
Add cloud-storage upload (S3 / GCS) with multipart + retries.
1) Instrument FastAPI with OpenTelemetry. 2) Emit per-request: TTFT, TPOT, total tokens, queue time, GPU utilization. 3) Export to Prometheus. 4) Build Grafana dashboard. 5) Add a daily eval-in-prod job (run a small canary eval set against the deployed model and alert on regression).
Stack
OpenTelemetry, Prometheus, Grafana, your Phase 9 server
Output
A live dashboard + alerting rules + a canary-eval cron.
How to Test
Trigger a regression (swap in a worse model) and verify alert fires.
Talking Points
What to monitor for LLMs that classical APM misses. The drift problem and how to catch it.
Resume Bullet
"Instrumented an LLM inference service with OpenTelemetry traces and Prometheus metrics (TTFT, TPOT, queue depth, KV-cache utilization); built Grafana dashboard and a daily canary-eval regression alert."
Extensions
Add Langfuse for prompt-level tracing; add cost dashboarding ($/req).