Every modern LLM stack — from FlashAttention to vLLM — is built on three things: (1) representing text as numbers, (2) doing linear algebra on those numbers efficiently, and (3) using PyTorch's autograd to learn the parameters. If you cannot tokenize a string, build a TF-IDF index, or write a clean PyTorch nn.Module, the rest of the curriculum will collapse under you.
Build three tokenizers (whitespace, regex, byte-level) and benchmark on a real corpus.
Concepts
Tokenization tradeoffs, vocab construction, OOV, byte fallback, special tokens.
Steps
1) Implement WhitespaceTokenizer.encode/decode. 2) Add a regex tokenizer matching GPT-2's pre-tokenization regex. 3) Implement a byte-level tokenizer (256-symbol vocab). 4) Build vocab from a corpus with frequency cutoff. 5) Round-trip test: decode(encode(s)) == s.
Stack
Python stdlib, regex library
Datasets
Tiny Shakespeare (1 MB), WikiText-2 (12 MB)
Output
A tokenizer.py module with 3 classes, plus a benchmark report (vocab size, compression ratio, encode speed).
How to Test
Round-trip property tests; compare token counts against tiktoken (GPT-2 encoding).
Talking Points
Why byte-level tokenizers can encode any string. Why GPT-2's regex splits contractions. The compression-vs-vocab-size tradeoff.
Resume Bullet
"Implemented three tokenizer variants (whitespace, regex, byte-level) with round-trip-safe encode/decode and benchmarked compression ratio (1.0 → 3.7×) and encode throughput on a 12 MB corpus."
A CLI search.py "your query" --top 5 that returns ranked docs with scores.
How to Test
Query for known topics, manually validate. Compare against sklearn's TfidfVectorizer (cosine within 1e-6).
Talking Points
Why IDF uses log. Why we L2-normalize. When TF-IDF beats embeddings (short, exact-match queries; cold start; explainability).
Resume Bullet
"Built a TF-IDF + cosine-similarity search engine over 10k Wikipedia docs from scratch in NumPy/SciPy; query latency P99 under 8 ms; results match sklearn within 1e-6."
Property tests (cosine in [-1, 1], symmetric, triangle inequality where applicable).
Talking Points
Why FAISS uses inner-product on normalized vectors instead of cosine.
Resume Bullet
"Authored a vector-similarity reference implementation (5 metrics) and visualized high-dimensional distance concentration on synthetic and real embedding distributions."
1) Tensor playground (10 broadcasting puzzles). 2) Implement linear regression manually with autograd. 3) Wrap as nn.Module. 4) Train on synthetic data. 5) Move to GPU; compare wall-clock.
Stack
PyTorch 2.x
Output
tensor_puzzles.py, linear_regression.py, a loss curve PNG.
How to Test
Closed-form least-squares solution must match autograd solution within 1e-3.
Talking Points
What .detach() does. Why with torch.no_grad(): matters in eval. How .backward() accumulates.
Resume Bullet
"Implemented from-scratch autograd-based linear regression in PyTorch, validated against closed-form NumPy least-squares within 1e-3, with CPU/GPU benchmark comparison."
Extensions
Add manual backward (no autograd) for a 2-layer MLP — sets up Phase 3.