🛸 Hitchhiker's Guide — Phase 1: Foundations (Text, Math, PyTorch)
Read this if: You have never built an ML model, or you have but you're shaky on tokenization, autograd, or why
cosine ≡ dot product on normalized vectors. By the end you should be able to explain every line ofphase-01-foundations-text/lab-01-tokenization-from-scratch/solution.pyto a stranger and know why every choice is made.
0. The 30-second mental model
A modern LLM is just three nested operations:
- Tokenize: turn a string into a list of integers (token IDs).
- Embed + Transform: look up an embedding vector per ID, then run a stack of matmul + nonlinearity layers.
- Predict + Sample: produce a probability distribution over the next token, sample from it, append, repeat.
Phase 1 covers (1) and the linear-algebra + PyTorch substrate that (2) and (3) need. Everything else in the curriculum is built on top of these primitives.
1. Prerequisite knowledge
If any bullet looks unfamiliar, knock it out first.
1.1 Python (the floor)
You need fluency, not mastery. Specifically:
- Data structures:
list,dict,set,tuple,collections.Counter,collections.defaultdict. Big-O of each. - Iteration:
for/while, comprehensions, generators (yield),itertools(chain,islice,groupby). - Functions: positional vs keyword,
*args/**kwargs, lambdas, decorators,functools.lru_cache. - OOP: classes,
__init__,__call__,__len__,__getitem__, dataclasses,@property. - Files & I/O: context managers (
with open(...)),pathlib, JSON/CSV. - Typing:
from __future__ import annotations,list[int],Optional,TypedDict,Protocol. - Performance hygiene: vectorize with NumPy, avoid Python
forover arrays, profile withcProfile/line_profiler.
References:
- Fluent Python, 2nd ed., Luciano Ramalho — the single best Python book for ML engineers.
- Python Speed/Performance Tips
- Real Python's generators tutorial
1.2 The shell, git, and an editor
- Bash basics (
grep,find,xargs, redirection, pipes). git: branch / commit / rebase / cherry-pick / bisect / reflog.- VS Code or Neovim with Python LSP, debugger configured, ability to set breakpoints.
tmuxorscreenfor long-running training jobs.
1.3 NumPy
NumPy is the lingua franca. If you can't think in arrays, you can't think in tensors.
np.array, dtype, shape, stride.- Broadcasting — read the rules until they are reflexes.
- Slicing, fancy indexing, boolean masks.
np.einsum(your eventual best friend; covered in §3.4).- Random: seeded
Generator(not legacynp.random.*).
References:
- Python for Data Analysis, 3rd ed., Wes McKinney
- From Python to Numpy by Nicolas Rougier — free, deep
- 100 NumPy exercises
1.4 Math you actually need
You don't need a PhD. You need:
| Topic | Depth | Why |
|---|---|---|
| Vector & matrix algebra | Solid | All of ML is matmuls |
| Probability basics | Solid | Cross-entropy, sampling, calibration |
| Calculus (chain rule) | Conceptual | Backprop is recursive chain rule |
| Information theory (entropy, KL) | Solid | Loss functions, RLHF |
| Statistics (mean/var/CLT, hypothesis tests) | Solid | Eval rigor, A/B tests |
Books that won't waste your time:
- Math: Mathematics for Machine Learning (Deisenroth, Faisal, Ong) — free PDF, read Ch. 2–6.
- Probability: Introduction to Probability (Blitzstein, Hwang) — first 7 chapters.
- Linear algebra (geometric intuition): 3Blue1Brown's Essence of Linear Algebra YouTube series. Watch all 15 videos. Mandatory.
- Calculus (intuition): 3Blue1Brown's Essence of Calculus.
2. Concept 1 — Text → Numbers (Tokenization)
2.1 Why we tokenize at all
Neural networks ingest tensors of numbers. We must convert text (a sequence of Unicode codepoints) into a sequence of small integer IDs that index into an embedding table (a learned matrix E ∈ ℝ^{V × d}). The choice of what counts as a "token" is the most consequential design decision in NLP. It determines:
- Sequence length (and thus compute cost — attention is O(T²)).
- Vocabulary size (which scales the embedding table and the LM head).
- Out-of-vocabulary (OOV) behavior.
- Whether the model can spell, count letters, do arithmetic, code.
2.2 The four families of tokenization
| Family | Unit | Vocab size | Sequence length | OOV? |
|---|---|---|---|---|
| Character | one Unicode char | ~150 (English), ~10k+ (CJK) | very long | none |
| Word | whitespace-split | huge (>1M) | short | massive |
| Subword (BPE/WordPiece/Unigram) | data-driven chunks | 30k–200k | medium | none (with byte fallback) |
| Byte-level | one byte | exactly 256 (+ merges) | longest | impossible |
Character: Simple, no OOV, but loses morphological structure. RNN char-LMs work but transformers struggle (sequences too long for O(T²) attention).
Word: Tokenizing on whitespace and punctuation. Big problem: "running", "ran", "runs" are unrelated tokens. And any new word is <UNK>. This is what Word2Vec used.
Subword (BPE — Byte Pair Encoding): The 2016 breakthrough (Sennrich et al.). Start with characters; iteratively merge the most frequent adjacent pair into a new symbol. Vocabulary becomes a mix of common whole words and morpheme-like fragments. "tokenization" might split into ["token", "ization"]. We'll implement this in Phase 4–5.
Byte-level BPE (GPT-2 onwards): start with 256 single bytes instead of Unicode characters, then BPE on top. Every UTF-8 string is encodable, period. No <UNK> exists. Lab 01 builds the byte-level skeleton.
2.3 GPT-2's pre-tokenization regex (worth understanding)
GPT-2 doesn't BPE-merge across word boundaries blindly. It first splits text using this regex:
GPT2_PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""
Translated:
'(?:[sdmt]|ll|ve|re)— contractions ('s,'t,'ll, …)?\p{L}+— letters with optional leading space?\p{N}+— digits?[^\s\p{L}\p{N}]+— punctuation\s+(?!\S)|\s+— whitespace handling
This pre-split prevents merges like the_cat becoming a single token, while still letting BPE merge inside word groups. Lab 01 implements this.
2.4 Why "5 + 3 = 8" sometimes confuses LLMs
Tokenization shapes capabilities. If "857" tokenizes as [8, 57] but "858" as [85, 8], the model has to learn arithmetic across token boundaries that depend on the input. This is why models historically struggled with multi-digit math. Modern fixes: digit-by-digit tokenization (Llama-3), or trained with right-to-left number reversal.
Read: Karpathy's Let's build the GPT tokenizer video — 2 hours, the single best tokenization resource on the internet.
2.5 References
- Sennrich, Haddow, Birch (2016), Neural Machine Translation of Rare Words with Subword Units — the BPE paper.
- Kudo (2018), Subword Regularization — Unigram tokenizer.
- HuggingFace tokenizers documentation.
- OpenAI's tiktoken source — read it.
3. Concept 2 — Linear Algebra for Neural Networks
3.1 What you must know cold
A neural network is a sequence of affine transforms (y = W x + b) interleaved with elementwise nonlinearities (ReLU, GELU, …). All the action is in:
- Matrix multiplication
(M, K) @ (K, N) → (M, N). Memorize the inner-dimension rule. - Transpose
Aᵀ. For batched tensors think of shape gymnastics. - Outer product
u vᵀ: a rank-1 matrix. - Inner / dot product
uᵀ v = Σ u_i v_i: a scalar. - Norm
‖v‖₂ = √(vᵀ v). L2 norm. - Cosine similarity
cos(u, v) = (uᵀ v) / (‖u‖ ‖v‖).
Key identity for retrieval: if ‖u‖ = ‖v‖ = 1, then cos(u, v) = uᵀ v. That's why FAISS and Qdrant store normalized vectors and use inner-product search — it's the same math, but cheaper.
3.2 Eigenvalues, SVD — when do you actually need them?
You will not compute eigenvalues by hand. But you'll meet them in:
- PCA (Phase 2 dimensionality reduction).
- Spectral norm / weight-norm regularization.
- Initialization theory (kaiming/xavier are about preserving variance, related to spectra).
- Understanding why attention has a "rank collapse" problem (Dong et al. 2021).
For now: know that SVD decomposes any matrix as A = U Σ Vᵀ with U, V orthogonal and Σ diagonal of singular values. LoRA (Phase 6) is a low-rank approximation justified by the observation that fine-tuning updates have low effective rank.
3.3 Probability and information theory primer
- Random variable, distribution, density (continuous) vs mass (discrete).
- Bayes:
P(A | B) = P(B | A) P(A) / P(B). - Expectation
𝔼[X] = Σ x P(x). - Variance
Var(X) = 𝔼[X²] - 𝔼[X]². - Entropy
H(p) = -Σ p log p— the average "surprise". Maximum at the uniform distribution. - Cross-entropy
H(p, q) = -Σ p log q. The loss function for classification (and thus next-token prediction). - KL divergence
D_KL(p ‖ q) = Σ p log(p/q). Distance-like (asymmetric); shows up in RLHF (PPO's KL constraint), DPO, distillation.
When an LLM minimizes cross-entropy on next tokens, it is minimizing D_KL(data ‖ model) + H(data) — and H(data) is fixed, so it's equivalently doing maximum likelihood.
3.4 Einstein summation (einsum) — the universal hammer
Once you can read einsum, every transformer paper becomes 5× clearer.
# Standard matmul
torch.einsum("ik,kj->ij", A, B) # == A @ B
# Batched matmul
torch.einsum("bik,bkj->bij", A, B) # == torch.bmm(A, B)
# Attention scores
torch.einsum("bhid,bhjd->bhij", Q, K) # == Q @ K.transpose(-1,-2)
# Multi-head value gather
torch.einsum("bhij,bhjd->bhid", attn, V)
Rules: indices that appear in inputs but not in output get summed over; indices that appear in both inputs and output are batched.
3.5 References
- 3Blue1Brown's Essence of Linear Algebra (mandatory).
- Strang, Introduction to Linear Algebra (book + MIT 18.06 lectures on YouTube).
einsumis all you need by Tim Rocktäschel.- Information Theory, Inference, and Learning Algorithms, MacKay — free PDF; read Ch. 2.
4. Concept 3 — Sparse Vector Retrieval (TF-IDF and BM25)
4.1 The problem
Given a query "best neural network books", rank N documents by relevance. The dense embedding approach (Phase 7) is overkill for many real workloads — a sparse keyword model gets you 80% of the way and is interpretable, fast, and updatable.
4.2 TF-IDF derivation
Term Frequency tf(t, d): how often term t appears in document d. Often log-scaled: tf' = 1 + log(tf) to dampen high counts.
Inverse Document Frequency idf(t) = log(N / df(t)), where df(t) is the number of documents containing t. Common terms ("the") get low weight; rare terms ("transformer") get high weight. The log is justified information-theoretically: if t appears in df of N documents, knowing it occurred carries -log(df/N) bits of information.
TF-IDF score of (t, d): tfidf(t, d) = tf'(t, d) · idf(t).
Document vector: a sparse vector indexed by vocabulary, with tfidf(t, d) at each position. L2-normalize so cosine similarity becomes a pure dot product.
Query: same transform, then score(d) = q · d. Top-k.
4.3 BM25 — the workhorse
BM25 is TF-IDF with two essential improvements: term-frequency saturation (the 50th occurrence of "neural" doesn't add 50× the signal) and length normalization (longer docs aren't unfairly favored). Formula:
$$ \text{BM25}(q, d) = \sum_{t \in q} \text{idf}(t) \cdot \frac{f(t, d)(k_1 + 1)}{f(t, d) + k_1 (1 - b + b \cdot |d|/\overline{|d|})} $$
with typical k_1 = 1.2, b = 0.75. This is what Elasticsearch / OpenSearch / Lucene actually use. Phase 7 covers hybrid search (BM25 + dense).
4.4 References
- Manning, Raghavan, Schütze, Introduction to Information Retrieval — free at nlp.stanford.edu/IR-book. Chapters 1, 6, 7.
- Robertson & Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.
5. Concept 4 — PyTorch and Autograd
5.1 Tensors
Think of a torch.Tensor as np.ndarray + (a) GPU dispatch and (b) automatic differentiation. The shape and dtype rules are nearly identical.
import torch
x = torch.zeros(3, 4) # shape (3, 4), float32
x = x.to("cuda") # device move
x = x.to(torch.bfloat16) # dtype cast
y = x[:, :2] # view; shares memory
z = x.contiguous().view(12) # reshape; may copy
Strides matter: a transpose returns a view with non-contiguous strides; some ops require .contiguous() first.
5.2 Autograd — the only paragraph that matters
When you do y = f(x) with x.requires_grad=True, PyTorch builds a computational graph of all intermediate ops. When you call y.backward(), it walks the graph in reverse and fills .grad on every leaf tensor that participates. That's it. This is just the chain rule executed automatically:
If L = f(g(h(x))) then dL/dx = f'(g(h(x))) · g'(h(x)) · h'(x).
PyTorch records each f, g, h and replays the derivatives backward.
x = torch.tensor(2.0, requires_grad=True)
y = (x ** 3 + 2 * x).sin() # y = sin(x³ + 2x)
y.backward() # populates x.grad
print(x.grad) # cos(12) * (3*4 + 2) = cos(12) * 14
Key APIs:
loss.backward()— compute gradients.optimizer.zero_grad()— clear them before the next step (gradients accumulate by default).optimizer.step()— apply the update.with torch.no_grad():— disable graph building (use in eval / inference). Saves memory.tensor.detach()— return a tensor that shares storage but is excluded from autograd.
5.3 The canonical training loop
model = MyModel().to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
for epoch in range(N):
for batch in loader:
opt.zero_grad()
logits = model(batch.x.to(device))
loss = F.cross_entropy(logits, batch.y.to(device))
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
opt.step()
Memorize this. Every script in this curriculum is a variation on these 7 lines.
5.4 What nn.Module actually is
A class that registers parameters (nn.Parameter) and submodules so that .parameters() recursively collects everything. State dict (model.state_dict()) is just a flat OrderedDict of all params + buffers — that's how saving/loading works.
5.5 References
- PyTorch Tutorials — start with Deep Learning with PyTorch: A 60 Minute Blitz.
- Deep Learning with PyTorch, Stevens, Antiga, Viehmann (Manning).
- Andrej Karpathy's Neural Networks: Zero to Hero — the micrograd lecture builds autograd from scratch in 100 lines. Mandatory.
- PyTorch internals by Edward Yang.
6. How the labs in this phase exercise these ideas
| Lab | Reinforces |
|---|---|
lab-01-tokenization-from-scratch | Concepts 1, 2.5 (regex), Python data structures, dataclasses, byte-level encoding |
| Lab 02 (TF-IDF) — spec only | Concepts 4, sparse matrices (SciPy CSR), L2 normalization |
| Lab 03 (similarity playground) — spec only | Concept 3.1, dimensionality intuition |
| Lab 04 (PyTorch essentials) — spec only | Concept 5, autograd, training loop |
For Lab 01 specifically, when you read the solution, ask yourself:
- Why does
RegexTokenizerneed that exact regex (and not just\w+)? - What happens if I encode an emoji with
WhitespaceTokenizervsByteLevelTokenizer? - Why is the byte-level vocab exactly 256 before adding merges?
- How would I extend this to BPE (Phase 4 will)?
7. Common interview questions on Phase 1 material
- Walk me through what
.backward()does. - Why is byte-level tokenization useful? Are there downsides?
- Why do BPE models sometimes count the letters in "strawberry" wrong?
- What's the difference between cosine similarity and dot product? When are they equivalent?
- Why is
H(p, q) = -Σ p log qthe right loss for classification? - What does
with torch.no_grad():do — and why does it matter for memory? - Implement TF-IDF on a whiteboard.
- What's the time complexity of self-attention in sequence length T? Why does that matter?
- Explain the difference between a view and a copy in PyTorch.
- Given a
(B, T, C)tensor, how do you compute per-batch row-wise softmax witheinsum/ broadcasting?
8. Going from solid → exceptional
After Phase 1, most candidates can do TF-IDF and write a training loop. To stand out:
- Implement BPE end-to-end (training + encoding) before anyone tells you to. Compare your output to
tiktokenbyte-for-byte. - Read the GPT-2 tokenizer source in
tiktokenand the original OpenAI repo. Understand the byte-to-unicode mapping (bytes_to_unicode()function — it's a clever hack to keep BPE in printable Unicode). - Read
microgradby Karpathy (~150 lines). You should be able to reimplement it from scratch in 1 hour by the end of the phase. - Profile a training step with
torch.profilerand identify the kernel-launch overhead vs compute. - Write a 1-page essay on "Why does tokenization shape model capability?" — using examples from the literature.
9. Recommended weekly cadence
| Day | Activity |
|---|---|
| Mon | Watch 3Blue1Brown linear algebra videos 1–5; read Phase 1 README |
| Tue | Watch micrograd lecture; reimplement micrograd from blank file |
| Wed | Lab 01 (tokenization) — solve lab.py without looking at solution.py |
| Thu | Lab 02 (TF-IDF) — write your own; compare to sklearn |
| Fri | Lab 03 (similarity playground) + Lab 04 (PyTorch) |
| Sat | Read Karpathy tokenizer video (2 hours) — fill any gaps |
| Sun | Quiz yourself on the 10 interview questions; write answers in your own words |
Move on to Phase 2 only when you can write the BPE training loop, the TF-IDF formula, and a PyTorch training loop on a whiteboard with no reference.