Phase 1 — Foundations: Text, Math, PyTorch

Difficulty: ⭐⭐☆☆☆ | Estimated Time: 1–2 weeks Roles supported: All — non-negotiable foundation.

Why This Phase Exists

Every modern LLM stack — from FlashAttention to vLLM — is built on three things: (1) representing text as numbers, (2) doing linear algebra on those numbers efficiently, and (3) using PyTorch's autograd to learn the parameters. If you cannot tokenize a string, build a TF-IDF index, or write a clean PyTorch nn.Module, the rest of the curriculum will collapse under you.

This phase rebuilds the floor.

Concepts

Text representation: characters → words → subwords
Tokenization: whitespace, regex, byte-level (BPE preview)
Vocabulary construction, OOV handling, special tokens
Bag-of-words (BoW) and term-document matrices
TF-IDF derivation and intuition
Cosine similarity, Euclidean distance, dot-product retrieval
Sparse vs dense vector representations
PyTorch tensors, broadcasting, indexing
Autograd: forward, backward, .grad, .detach(), .no_grad()
CPU/GPU dispatch, .to(device), pinned memory basics
Linear algebra refresher: matmul, transpose, einsum, eigendecomposition

Labs

Lab 01 — Tokenization From Scratch

Field	Value
Goal	Build three tokenizers (whitespace, regex, byte-level) and benchmark on a real corpus.
Concepts	Tokenization tradeoffs, vocab construction, OOV, byte fallback, special tokens.
Steps	1) Implement `WhitespaceTokenizer.encode/decode`. 2) Add a regex tokenizer matching GPT-2's pre-tokenization regex. 3) Implement a byte-level tokenizer (256-symbol vocab). 4) Build vocab from a corpus with frequency cutoff. 5) Round-trip test: `decode(encode(s)) == s`.
Stack	Python stdlib, `regex` library
Datasets	Tiny Shakespeare (1 MB), WikiText-2 (12 MB)
Output	A `tokenizer.py` module with 3 classes, plus a benchmark report (vocab size, compression ratio, encode speed).
How to Test	Round-trip property tests; compare token counts against `tiktoken` (GPT-2 encoding).
Talking Points	Why byte-level tokenizers can encode any string. Why GPT-2's regex splits contractions. The compression-vs-vocab-size tradeoff.
Resume Bullet	"Implemented three tokenizer variants (whitespace, regex, byte-level) with round-trip-safe encode/decode and benchmarked compression ratio (1.0 → 3.7×) and encode throughput on a 12 MB corpus."
Extensions	Add unicode normalization (NFC/NFKC); plot vocab-size-vs-coverage curves.

Lab 02 — Bag-of-Words & TF-IDF From Scratch

Field	Value
Goal	Implement TF-IDF and a cosine-similarity search engine over a Wikipedia subset, with no sklearn.
Concepts	Term frequency, document frequency, sublinear TF, IDF smoothing, sparse matrix construction (CSR), cosine similarity.
Steps	1) Build a sparse term-document matrix with `scipy.sparse.csr_matrix`. 2) Compute TF (raw + log-normalized). 3) Compute IDF with smoothing. 4) L2-normalize rows. 5) Cosine similarity = sparse dot product. 6) Build a top-k search function.
Stack	NumPy, SciPy sparse, `regex`
Datasets	A 10k-document slice of Wikipedia or 20 Newsgroups
Output	A CLI `search.py "your query" --top 5` that returns ranked docs with scores.
How to Test	Query for known topics, manually validate. Compare against sklearn's `TfidfVectorizer` (cosine within 1e-6).
Talking Points	Why IDF uses log. Why we L2-normalize. When TF-IDF beats embeddings (short, exact-match queries; cold start; explainability).
Resume Bullet	"Built a TF-IDF + cosine-similarity search engine over 10k Wikipedia docs from scratch in NumPy/SciPy; query latency P99 under 8 ms; results match sklearn within 1e-6."
Extensions	Add BM25 scoring (used heavily in Phase 7); add query expansion.

Lab 03 — Cosine Similarity & Retrieval Playground

Field	Value
Goal	Internalize vector similarity by implementing 5 metrics and visualizing failure modes.
Concepts	Cosine vs dot product vs Euclidean, normalization invariants, curse of dimensionality.
Steps	1) Implement cosine, dot, Euclidean, Manhattan, Jaccard. 2) Generate synthetic vectors (Gaussian, sparse, normalized). 3) Plot pairwise distance distributions. 4) Show cosine ≡ dot when L2-normalized.
Stack	NumPy, matplotlib
Output	`metrics.py` + a notebook of histograms.
How to Test	Property tests (cosine in [-1, 1], symmetric, triangle inequality where applicable).
Talking Points	Why FAISS uses inner-product on normalized vectors instead of cosine.
Resume Bullet	"Authored a vector-similarity reference implementation (5 metrics) and visualized high-dimensional distance concentration on synthetic and real embedding distributions."
Extensions	Add MIPS-via-LSH demo (precursor to Phase 7).

Lab 04 — PyTorch Essentials & Autograd

Field	Value
Goal	Become fluent with tensors, autograd, and a from-scratch training loop on a toy regression problem.
Concepts	Tensor creation, broadcasting, indexing, `requires_grad`, computational graph, `.backward()`, `optim.SGD`, `optim.AdamW`, batching, `DataLoader`.
Steps	1) Tensor playground (10 broadcasting puzzles). 2) Implement linear regression manually with autograd. 3) Wrap as `nn.Module`. 4) Train on synthetic data. 5) Move to GPU; compare wall-clock.
Stack	PyTorch 2.x
Output	`tensor_puzzles.py`, `linear_regression.py`, a loss curve PNG.
How to Test	Closed-form least-squares solution must match autograd solution within 1e-3.
Talking Points	What `.detach()` does. Why `with torch.no_grad():` matters in eval. How `.backward()` accumulates.
Resume Bullet	"Implemented from-scratch autograd-based linear regression in PyTorch, validated against closed-form NumPy least-squares within 1e-3, with CPU/GPU benchmark comparison."
Extensions	Add manual backward (no autograd) for a 2-layer MLP — sets up Phase 3.

Deliverables Checklist

Three tokenizers (whitespace / regex / byte) with round-trip tests
TF-IDF search engine over 10k docs, validated against sklearn
Pairwise-distance visualization notebook
Linear regression in pure PyTorch with autograd

Interview Relevance

"How does TF-IDF differ from a dense embedding retrieval?" (you can answer both)
"Walk me through autograd."
"What does .detach() do?"
"Why is byte-level tokenization useful?"

LLM Inference Engineer