06 — Pretraining Data Pipeline (10TB → Tokens)

Roles: Pretraining Data Engineer · Research Engineer Pretraining

1. Requirements

  • Process 10s of TB raw web (CommonCrawl) → tokenized training shards
  • Reproducible & auditable (every token traceable to a source URL)
  • Deduped at scale (URL, exact, near-dup)
  • Quality-filtered; PII-scrubbed; configurable per-source weights
  • Resumable; idempotent

2. Stages

[Raw WARC/WET shards on S3]
        │
        ▼
[Stage 1] Parse + extract (trafilatura/justext for HTML, or use WET)
        │
        ▼
[Stage 2] URL dedup (Bloom filter / RocksDB)
        │
        ▼
[Stage 3] Language ID (fasttext lid.176; keep en + others by quota)
        │
        ▼
[Stage 4] Quality filters
            - Gopher rules (length, mean word len, symbol ratio, repetition)
            - Classifier (FastText: positive=Wikipedia/books, negative=random web)
        │
        ▼
[Stage 5] PII scrub (presidio + regex; emails, phones, SSN)
        │
        ▼
[Stage 6] Near-dup (MinHash LSH @ Jaccard 0.8; SuffixArray for exact spans)
        │
        ▼
[Stage 7] Toxicity / safety filter (configurable threshold)
        │
        ▼
[Stage 8] Tokenize (your custom BPE) → uint16/uint32 .bin shards
        │
        ▼
[Stage 9] Mix + interleave with weights (web 60%, code 20%, books 10%, math 10%)
        │
        ▼
[Final shards on S3, manifest.json with hashes + counts + lineage]

3. Deep Dives

3.1 Distributed Execution

  • Spark / Ray / Dask on a cluster (1000s of vCPU)
  • Shard-parallel: each task processes ≤1 GB
  • Idempotent: writes go to out/{stage}/{shard_id}.parquet; restart skips existing

3.2 MinHash LSH at Scale

  • 128 perms, threshold 0.8
  • Group docs by band; only compare within a bucket
  • For 1B docs: cluster MinHash with 1024 bands → linear pass possible
  • Output: keep one doc per cluster (longest, or earliest crawl date)

3.3 Data-Mix Tuning

  • Ablation runs (small models, fixed compute) sweeping mix weights
  • DSIR / DoReMi for principled mix search
  • Final mix is compute-optimal at the target model size, not the proxy

3.4 PII & Safety

  • Scrub before storage, not just before training
  • Audit: log per-stage drop counts; alert on anomalies
  • Honor takedown requests: source URL → shard ID lookup; rebuild affected shards

3.5 Reproducibility

  • Each shard's manifest: stage version + config hash + input shard ID + count in/out
  • Lineage graph queryable (DataHub / OpenMetadata)
  • Re-running with same configs deterministically reproduces output

4. Observability

  • Per-stage: docs in/out, MB in/out, drop reasons (categorized)
  • Per-shard: language histogram, length distribution, sample 10 docs to S3 for manual spot-check

5. Tradeoffs

ChoiceAltWhen
SparkRay DatasetsSpark for stable batch; Ray when mixing GPU stages
MinHash LSHSimHashMinHash for general dedup; SimHash for short docs
Custom tokenizerGPT-2 BPECustom when target language coverage matters (e.g., code, math, multilingual)
Filter earlyFilter lateAlways filter early — saves all downstream compute