🛸 Hitchhiker's Guide — Phase 1: Foundations (Text, Math, PyTorch)

Read this if: You have never built an ML model, or you have but you're shaky on tokenization, autograd, or why cosine ≡ dot product on normalized vectors. By the end you should be able to explain every line of phase-01-foundations-text/lab-01-tokenization-from-scratch/solution.py to a stranger and know why every choice is made.

0. The 30-second mental model

A modern LLM is just three nested operations:

Tokenize: turn a string into a list of integers (token IDs).
Embed + Transform: look up an embedding vector per ID, then run a stack of matmul + nonlinearity layers.
Predict + Sample: produce a probability distribution over the next token, sample from it, append, repeat.

Phase 1 covers (1) and the linear-algebra + PyTorch substrate that (2) and (3) need. Everything else in the curriculum is built on top of these primitives.

1. Prerequisite knowledge

If any bullet looks unfamiliar, knock it out first.

1.1 Python (the floor)

You need fluency, not mastery. Specifically:

Data structures: list, dict, set, tuple, collections.Counter, collections.defaultdict. Big-O of each.
Iteration: for/while, comprehensions, generators (yield), itertools (chain, islice, groupby).
Functions: positional vs keyword, *args/**kwargs, lambdas, decorators, functools.lru_cache.
OOP: classes, __init__, __call__, __len__, __getitem__, dataclasses, @property.
Files & I/O: context managers (with open(...)), pathlib, JSON/CSV.
Typing: from __future__ import annotations, list[int], Optional, TypedDict, Protocol.
Performance hygiene: vectorize with NumPy, avoid Python for over arrays, profile with cProfile/line_profiler.

References:

Fluent Python, 2nd ed., Luciano Ramalho — the single best Python book for ML engineers.
Python Speed/Performance Tips
Real Python's generators tutorial

1.2 The shell, git, and an editor

Bash basics (grep, find, xargs, redirection, pipes).
git: branch / commit / rebase / cherry-pick / bisect / reflog.
VS Code or Neovim with Python LSP, debugger configured, ability to set breakpoints.
tmux or screen for long-running training jobs.

1.3 NumPy

NumPy is the lingua franca. If you can't think in arrays, you can't think in tensors.

np.array, dtype, shape, stride.
Broadcasting — read the rules until they are reflexes.
Slicing, fancy indexing, boolean masks.
np.einsum (your eventual best friend; covered in §3.4).
Random: seeded Generator (not legacy np.random.*).

References:

Python for Data Analysis, 3rd ed., Wes McKinney
From Python to Numpy by Nicolas Rougier — free, deep
100 NumPy exercises

1.4 Math you actually need

You don't need a PhD. You need:

Topic	Depth	Why
Vector & matrix algebra	Solid	All of ML is matmuls
Probability basics	Solid	Cross-entropy, sampling, calibration
Calculus (chain rule)	Conceptual	Backprop is recursive chain rule
Information theory (entropy, KL)	Solid	Loss functions, RLHF
Statistics (mean/var/CLT, hypothesis tests)	Solid	Eval rigor, A/B tests

Books that won't waste your time:

Math: Mathematics for Machine Learning (Deisenroth, Faisal, Ong) — free PDF, read Ch. 2–6.
Probability: Introduction to Probability (Blitzstein, Hwang) — first 7 chapters.
Linear algebra (geometric intuition): 3Blue1Brown's Essence of Linear Algebra YouTube series. Watch all 15 videos. Mandatory.
Calculus (intuition): 3Blue1Brown's Essence of Calculus.

2. Concept 1 — Text → Numbers (Tokenization)

Neural networks ingest tensors of numbers. We must convert text (a sequence of Unicode codepoints) into a sequence of small integer IDs that index into an embedding table (a learned matrix E ∈ ℝ^{V × d}). The choice of what counts as a "token" is the most consequential design decision in NLP. It determines:

Sequence length (and thus compute cost — attention is O(T²)).
Vocabulary size (which scales the embedding table and the LM head).
Out-of-vocabulary (OOV) behavior.
Whether the model can spell, count letters, do arithmetic, code.

2.2 The four families of tokenization

Family	Unit	Vocab size	Sequence length	OOV?
Character	one Unicode char	~150 (English), ~10k+ (CJK)	very long	none
Word	whitespace-split	huge (>1M)	short	massive
Subword (BPE/WordPiece/Unigram)	data-driven chunks	30k–200k	medium	none (with byte fallback)
Byte-level	one byte	exactly 256 (+ merges)	longest	impossible

Character: Simple, no OOV, but loses morphological structure. RNN char-LMs work but transformers struggle (sequences too long for O(T²) attention).

Word: Tokenizing on whitespace and punctuation. Big problem: "running", "ran", "runs" are unrelated tokens. And any new word is <UNK>. This is what Word2Vec used.

Subword (BPE — Byte Pair Encoding): The 2016 breakthrough (Sennrich et al.). Start with characters; iteratively merge the most frequent adjacent pair into a new symbol. Vocabulary becomes a mix of common whole words and morpheme-like fragments. "tokenization" might split into ["token", "ization"]. We'll implement this in Phase 4–5.

Byte-level BPE (GPT-2 onwards): start with 256 single bytes instead of Unicode characters, then BPE on top. Every UTF-8 string is encodable, period. No <UNK> exists. Lab 01 builds the byte-level skeleton.

2.3 GPT-2's pre-tokenization regex (worth understanding)

GPT-2 doesn't BPE-merge across word boundaries blindly. It first splits text using this regex:

GPT2_PAT = r"""'(?:[sdmt]|ll|ve|re)| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+"""

Translated:

'(?:[sdmt]|ll|ve|re) — contractions ('s, 't, 'll, …)
?\p{L}+ — letters with optional leading space
?\p{N}+ — digits
?[^\s\p{L}\p{N}]+ — punctuation
\s+(?!\S)|\s+ — whitespace handling

This pre-split prevents merges like the_cat becoming a single token, while still letting BPE merge inside word groups. Lab 01 implements this.

2.4 Why "5 + 3 = 8" sometimes confuses LLMs

Tokenization shapes capabilities. If "857" tokenizes as [8, 57] but "858" as [85, 8], the model has to learn arithmetic across token boundaries that depend on the input. This is why models historically struggled with multi-digit math. Modern fixes: digit-by-digit tokenization (Llama-3), or trained with right-to-left number reversal.

Read: Karpathy's Let's build the GPT tokenizer video — 2 hours, the single best tokenization resource on the internet.

2.5 References

Sennrich, Haddow, Birch (2016), Neural Machine Translation of Rare Words with Subword Units — the BPE paper.
Kudo (2018), Subword Regularization — Unigram tokenizer.
HuggingFace tokenizers documentation.
OpenAI's tiktoken source — read it.

3. Concept 2 — Linear Algebra for Neural Networks

3.1 What you must know cold

A neural network is a sequence of affine transforms (y = W x + b) interleaved with elementwise nonlinearities (ReLU, GELU, …). All the action is in:

Matrix multiplication (M, K) @ (K, N) → (M, N). Memorize the inner-dimension rule.
Transpose Aᵀ. For batched tensors think of shape gymnastics.
Outer product u vᵀ: a rank-1 matrix.
Inner / dot product uᵀ v = Σ u_i v_i: a scalar.
Norm ‖v‖₂ = √(vᵀ v). L2 norm.
Cosine similarity cos(u, v) = (uᵀ v) / (‖u‖ ‖v‖).

Key identity for retrieval: if ‖u‖ = ‖v‖ = 1, then cos(u, v) = uᵀ v. That's why FAISS and Qdrant store normalized vectors and use inner-product search — it's the same math, but cheaper.

3.2 Eigenvalues, SVD — when do you actually need them?

You will not compute eigenvalues by hand. But you'll meet them in:

PCA (Phase 2 dimensionality reduction).
Spectral norm / weight-norm regularization.
Initialization theory (kaiming/xavier are about preserving variance, related to spectra).
Understanding why attention has a "rank collapse" problem (Dong et al. 2021).

For now: know that SVD decomposes any matrix as A = U Σ Vᵀ with U, V orthogonal and Σ diagonal of singular values. LoRA (Phase 6) is a low-rank approximation justified by the observation that fine-tuning updates have low effective rank.

3.3 Probability and information theory primer

Random variable, distribution, density (continuous) vs mass (discrete).
Bayes: P(A | B) = P(B | A) P(A) / P(B).
Expectation 𝔼[X] = Σ x P(x).
Variance Var(X) = 𝔼[X²] - 𝔼[X]².
Entropy H(p) = -Σ p log p — the average "surprise". Maximum at the uniform distribution.
Cross-entropy H(p, q) = -Σ p log q. The loss function for classification (and thus next-token prediction).
KL divergence D_KL(p ‖ q) = Σ p log(p/q). Distance-like (asymmetric); shows up in RLHF (PPO's KL constraint), DPO, distillation.

When an LLM minimizes cross-entropy on next tokens, it is minimizing D_KL(data ‖ model) + H(data) — and H(data) is fixed, so it's equivalently doing maximum likelihood.

3.4 Einstein summation (`einsum`) — the universal hammer

Once you can read einsum, every transformer paper becomes 5× clearer.

# Standard matmul
torch.einsum("ik,kj->ij", A, B)   # == A @ B

# Batched matmul
torch.einsum("bik,bkj->bij", A, B)   # == torch.bmm(A, B)

# Attention scores
torch.einsum("bhid,bhjd->bhij", Q, K)   # == Q @ K.transpose(-1,-2)

# Multi-head value gather
torch.einsum("bhij,bhjd->bhid", attn, V)

Rules: indices that appear in inputs but not in output get summed over; indices that appear in both inputs and output are batched.

3.5 References

3Blue1Brown's Essence of Linear Algebra (mandatory).
Strang, Introduction to Linear Algebra (book + MIT 18.06 lectures on YouTube).
einsum is all you need by Tim Rocktäschel.
Information Theory, Inference, and Learning Algorithms, MacKay — free PDF; read Ch. 2.

4. Concept 3 — Sparse Vector Retrieval (TF-IDF and BM25)

4.1 The problem

Given a query "best neural network books", rank N documents by relevance. The dense embedding approach (Phase 7) is overkill for many real workloads — a sparse keyword model gets you 80% of the way and is interpretable, fast, and updatable.

4.2 TF-IDF derivation

Term Frequency tf(t, d): how often term t appears in document d. Often log-scaled: tf' = 1 + log(tf) to dampen high counts.

Inverse Document Frequency idf(t) = log(N / df(t)), where df(t) is the number of documents containing t. Common terms ("the") get low weight; rare terms ("transformer") get high weight. The log is justified information-theoretically: if t appears in df of N documents, knowing it occurred carries -log(df/N) bits of information.

TF-IDF score of (t, d): tfidf(t, d) = tf'(t, d) · idf(t).

Document vector: a sparse vector indexed by vocabulary, with tfidf(t, d) at each position. L2-normalize so cosine similarity becomes a pure dot product.

Query: same transform, then score(d) = q · d. Top-k.

4.3 BM25 — the workhorse

BM25 is TF-IDF with two essential improvements: term-frequency saturation (the 50th occurrence of "neural" doesn't add 50× the signal) and length normalization (longer docs aren't unfairly favored). Formula:

$$ \text{BM25}(q, d) = \sum_{t \in q} \text{idf}(t) \cdot \frac{f(t, d)(k_1 + 1)}{f(t, d) + k_1 (1 - b + b \cdot |d|/\overline{|d|})} $$

with typical k_1 = 1.2, b = 0.75. This is what Elasticsearch / OpenSearch / Lucene actually use. Phase 7 covers hybrid search (BM25 + dense).

4.4 References

Manning, Raghavan, Schütze, Introduction to Information Retrieval — free at nlp.stanford.edu/IR-book. Chapters 1, 6, 7.
Robertson & Zaragoza (2009), The Probabilistic Relevance Framework: BM25 and Beyond.

5. Concept 4 — PyTorch and Autograd

5.1 Tensors

Think of a torch.Tensor as np.ndarray + (a) GPU dispatch and (b) automatic differentiation. The shape and dtype rules are nearly identical.

import torch
x = torch.zeros(3, 4)              # shape (3, 4), float32
x = x.to("cuda")                   # device move
x = x.to(torch.bfloat16)           # dtype cast
y = x[:, :2]                       # view; shares memory
z = x.contiguous().view(12)        # reshape; may copy

Strides matter: a transpose returns a view with non-contiguous strides; some ops require .contiguous() first.

5.2 Autograd — the only paragraph that matters

When you do y = f(x) with x.requires_grad=True, PyTorch builds a computational graph of all intermediate ops. When you call y.backward(), it walks the graph in reverse and fills .grad on every leaf tensor that participates. That's it. This is just the chain rule executed automatically:

If L = f(g(h(x))) then dL/dx = f'(g(h(x))) · g'(h(x)) · h'(x).

PyTorch records each f, g, h and replays the derivatives backward.

x = torch.tensor(2.0, requires_grad=True)
y = (x ** 3 + 2 * x).sin()         # y = sin(x³ + 2x)
y.backward()                        # populates x.grad
print(x.grad)                       # cos(12) * (3*4 + 2) = cos(12) * 14

Key APIs:

loss.backward() — compute gradients.
optimizer.zero_grad() — clear them before the next step (gradients accumulate by default).
optimizer.step() — apply the update.
with torch.no_grad(): — disable graph building (use in eval / inference). Saves memory.
tensor.detach() — return a tensor that shares storage but is excluded from autograd.

5.3 The canonical training loop

model = MyModel().to(device)
opt = torch.optim.AdamW(model.parameters(), lr=3e-4)
for epoch in range(N):
    for batch in loader:
        opt.zero_grad()
        logits = model(batch.x.to(device))
        loss = F.cross_entropy(logits, batch.y.to(device))
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        opt.step()

Memorize this. Every script in this curriculum is a variation on these 7 lines.

5.4 What `nn.Module` actually is

A class that registers parameters (nn.Parameter) and submodules so that .parameters() recursively collects everything. State dict (model.state_dict()) is just a flat OrderedDict of all params + buffers — that's how saving/loading works.

5.5 References

PyTorch Tutorials — start with Deep Learning with PyTorch: A 60 Minute Blitz.
Deep Learning with PyTorch, Stevens, Antiga, Viehmann (Manning).
Andrej Karpathy's Neural Networks: Zero to Hero — the micrograd lecture builds autograd from scratch in 100 lines. Mandatory.
PyTorch internals by Edward Yang.

6. How the labs in this phase exercise these ideas

Lab	Reinforces
`lab-01-tokenization-from-scratch`	Concepts 1, 2.5 (regex), Python data structures, dataclasses, byte-level encoding
Lab 02 (TF-IDF) — spec only	Concepts 4, sparse matrices (SciPy CSR), L2 normalization
Lab 03 (similarity playground) — spec only	Concept 3.1, dimensionality intuition
Lab 04 (PyTorch essentials) — spec only	Concept 5, autograd, training loop

For Lab 01 specifically, when you read the solution, ask yourself:

Why does RegexTokenizer need that exact regex (and not just \w+)?
What happens if I encode an emoji with WhitespaceTokenizer vs ByteLevelTokenizer?
Why is the byte-level vocab exactly 256 before adding merges?
How would I extend this to BPE (Phase 4 will)?

7. Common interview questions on Phase 1 material

Walk me through what .backward() does.
Why is byte-level tokenization useful? Are there downsides?
Why do BPE models sometimes count the letters in "strawberry" wrong?
What's the difference between cosine similarity and dot product? When are they equivalent?
Why is H(p, q) = -Σ p log q the right loss for classification?
What does with torch.no_grad(): do — and why does it matter for memory?
Implement TF-IDF on a whiteboard.
What's the time complexity of self-attention in sequence length T? Why does that matter?
Explain the difference between a view and a copy in PyTorch.
Given a (B, T, C) tensor, how do you compute per-batch row-wise softmax with einsum / broadcasting?

8. Going from solid → exceptional

After Phase 1, most candidates can do TF-IDF and write a training loop. To stand out:

Implement BPE end-to-end (training + encoding) before anyone tells you to. Compare your output to tiktoken byte-for-byte.
Read the GPT-2 tokenizer source in tiktoken and the original OpenAI repo. Understand the byte-to-unicode mapping (bytes_to_unicode() function — it's a clever hack to keep BPE in printable Unicode).
Read micrograd by Karpathy (~150 lines). You should be able to reimplement it from scratch in 1 hour by the end of the phase.
Profile a training step with torch.profiler and identify the kernel-launch overhead vs compute.
Write a 1-page essay on "Why does tokenization shape model capability?" — using examples from the literature.

9. Recommended weekly cadence

Day	Activity
Mon	Watch 3Blue1Brown linear algebra videos 1–5; read Phase 1 README
Tue	Watch micrograd lecture; reimplement micrograd from blank file
Wed	Lab 01 (tokenization) — solve `lab.py` without looking at `solution.py`
Thu	Lab 02 (TF-IDF) — write your own; compare to sklearn
Fri	Lab 03 (similarity playground) + Lab 04 (PyTorch)
Sat	Read Karpathy tokenizer video (2 hours) — fill any gaps
Sun	Quiz yourself on the 10 interview questions; write answers in your own words

Move on to Phase 2 only when you can write the BPE training loop, the TF-IDF formula, and a PyTorch training loop on a whiteboard with no reference.

AI Engineer — Role-Based Learning Hub