Phase 1 — Foundations: Text, Math, PyTorch

Difficulty: ⭐⭐☆☆☆ | Estimated Time: 1–2 weeks Roles supported: All — non-negotiable foundation.


Why This Phase Exists

Every modern LLM stack — from FlashAttention to vLLM — is built on three things: (1) representing text as numbers, (2) doing linear algebra on those numbers efficiently, and (3) using PyTorch's autograd to learn the parameters. If you cannot tokenize a string, build a TF-IDF index, or write a clean PyTorch nn.Module, the rest of the curriculum will collapse under you.

This phase rebuilds the floor.


Concepts

  • Text representation: characters → words → subwords
  • Tokenization: whitespace, regex, byte-level (BPE preview)
  • Vocabulary construction, OOV handling, special tokens
  • Bag-of-words (BoW) and term-document matrices
  • TF-IDF derivation and intuition
  • Cosine similarity, Euclidean distance, dot-product retrieval
  • Sparse vs dense vector representations
  • PyTorch tensors, broadcasting, indexing
  • Autograd: forward, backward, .grad, .detach(), .no_grad()
  • CPU/GPU dispatch, .to(device), pinned memory basics
  • Linear algebra refresher: matmul, transpose, einsum, eigendecomposition

Labs

Lab 01 — Tokenization From Scratch

FieldValue
GoalBuild three tokenizers (whitespace, regex, byte-level) and benchmark on a real corpus.
ConceptsTokenization tradeoffs, vocab construction, OOV, byte fallback, special tokens.
Steps1) Implement WhitespaceTokenizer.encode/decode. 2) Add a regex tokenizer matching GPT-2's pre-tokenization regex. 3) Implement a byte-level tokenizer (256-symbol vocab). 4) Build vocab from a corpus with frequency cutoff. 5) Round-trip test: decode(encode(s)) == s.
StackPython stdlib, regex library
DatasetsTiny Shakespeare (1 MB), WikiText-2 (12 MB)
OutputA tokenizer.py module with 3 classes, plus a benchmark report (vocab size, compression ratio, encode speed).
How to TestRound-trip property tests; compare token counts against tiktoken (GPT-2 encoding).
Talking PointsWhy byte-level tokenizers can encode any string. Why GPT-2's regex splits contractions. The compression-vs-vocab-size tradeoff.
Resume Bullet"Implemented three tokenizer variants (whitespace, regex, byte-level) with round-trip-safe encode/decode and benchmarked compression ratio (1.0 → 3.7×) and encode throughput on a 12 MB corpus."
ExtensionsAdd unicode normalization (NFC/NFKC); plot vocab-size-vs-coverage curves.

Lab 02 — Bag-of-Words & TF-IDF From Scratch

FieldValue
GoalImplement TF-IDF and a cosine-similarity search engine over a Wikipedia subset, with no sklearn.
ConceptsTerm frequency, document frequency, sublinear TF, IDF smoothing, sparse matrix construction (CSR), cosine similarity.
Steps1) Build a sparse term-document matrix with scipy.sparse.csr_matrix. 2) Compute TF (raw + log-normalized). 3) Compute IDF with smoothing. 4) L2-normalize rows. 5) Cosine similarity = sparse dot product. 6) Build a top-k search function.
StackNumPy, SciPy sparse, regex
DatasetsA 10k-document slice of Wikipedia or 20 Newsgroups
OutputA CLI search.py "your query" --top 5 that returns ranked docs with scores.
How to TestQuery for known topics, manually validate. Compare against sklearn's TfidfVectorizer (cosine within 1e-6).
Talking PointsWhy IDF uses log. Why we L2-normalize. When TF-IDF beats embeddings (short, exact-match queries; cold start; explainability).
Resume Bullet"Built a TF-IDF + cosine-similarity search engine over 10k Wikipedia docs from scratch in NumPy/SciPy; query latency P99 under 8 ms; results match sklearn within 1e-6."
ExtensionsAdd BM25 scoring (used heavily in Phase 7); add query expansion.

Lab 03 — Cosine Similarity & Retrieval Playground

FieldValue
GoalInternalize vector similarity by implementing 5 metrics and visualizing failure modes.
ConceptsCosine vs dot product vs Euclidean, normalization invariants, curse of dimensionality.
Steps1) Implement cosine, dot, Euclidean, Manhattan, Jaccard. 2) Generate synthetic vectors (Gaussian, sparse, normalized). 3) Plot pairwise distance distributions. 4) Show cosine ≡ dot when L2-normalized.
StackNumPy, matplotlib
Outputmetrics.py + a notebook of histograms.
How to TestProperty tests (cosine in [-1, 1], symmetric, triangle inequality where applicable).
Talking PointsWhy FAISS uses inner-product on normalized vectors instead of cosine.
Resume Bullet"Authored a vector-similarity reference implementation (5 metrics) and visualized high-dimensional distance concentration on synthetic and real embedding distributions."
ExtensionsAdd MIPS-via-LSH demo (precursor to Phase 7).

Lab 04 — PyTorch Essentials & Autograd

FieldValue
GoalBecome fluent with tensors, autograd, and a from-scratch training loop on a toy regression problem.
ConceptsTensor creation, broadcasting, indexing, requires_grad, computational graph, .backward(), optim.SGD, optim.AdamW, batching, DataLoader.
Steps1) Tensor playground (10 broadcasting puzzles). 2) Implement linear regression manually with autograd. 3) Wrap as nn.Module. 4) Train on synthetic data. 5) Move to GPU; compare wall-clock.
StackPyTorch 2.x
Outputtensor_puzzles.py, linear_regression.py, a loss curve PNG.
How to TestClosed-form least-squares solution must match autograd solution within 1e-3.
Talking PointsWhat .detach() does. Why with torch.no_grad(): matters in eval. How .backward() accumulates.
Resume Bullet"Implemented from-scratch autograd-based linear regression in PyTorch, validated against closed-form NumPy least-squares within 1e-3, with CPU/GPU benchmark comparison."
ExtensionsAdd manual backward (no autograd) for a 2-layer MLP — sets up Phase 3.

Deliverables Checklist

  • Three tokenizers (whitespace / regex / byte) with round-trip tests
  • TF-IDF search engine over 10k docs, validated against sklearn
  • Pairwise-distance visualization notebook
  • Linear regression in pure PyTorch with autograd

Interview Relevance

  • "How does TF-IDF differ from a dense embedding retrieval?" (you can answer both)
  • "Walk me through autograd."
  • "What does .detach() do?"
  • "Why is byte-level tokenization useful?"