🛸 Hitchhiker's Guide — Phase 2: Classical NLP & Word Embeddings
Read this if: You can write a TF-IDF index but you don't yet feel in your bones why
king − man + woman ≈ queenfalls out of word2vec, or why "softmax over the whole vocabulary is too expensive" is the historical pivot that led to negative sampling.
0. The 30-second mental model
A word embedding is a learned dense vector that captures meaning by co-occurrence statistics. The training signal is: "predict context from word" (or vice versa). After enough data, vectors of similar words cluster together — and useful linear structure emerges (analogies). This is the conceptual ancestor of token embeddings inside transformers.
By the end of Phase 2 you should:
- Know the distributional hypothesis and why it makes sense.
- Be able to derive Skip-gram with negative sampling from scratch.
- Understand the difference between count-based (PPMI, GloVe) and predict-based (word2vec) embeddings, and the surprising 2014 result that they're closely related.
- Know how to evaluate an embedding (intrinsic vs extrinsic).
- Understand why static embeddings were superseded by contextual embeddings (ELMo → BERT) and where they are still used today (retrieval, recommender systems, cold start).
1. The Distributional Hypothesis
"You shall know a word by the company it keeps." — J. R. Firth, 1957
If two words appear in similar contexts, they probably mean similar things. That's the entire premise. Make a giant matrix M ∈ ℝ^{V × V} where M[i, j] = "how often word i appears near word j" — the rows are word representations. The rest of the field is "how do we make this matrix smaller and better".
1.1 Three classes of word representations
- Count-based: build the co-occurrence matrix; reduce dimensionality (SVD on PPMI). Examples: LSA, HAL, PPMI+SVD.
- Predict-based: train a neural model whose weights become the word vectors. Examples: word2vec, GloVe (hybrid), FastText.
- Contextual: embeddings depend on the sentence around the word. Examples: ELMo, BERT, every modern LLM. Phase 4+.
1.2 PPMI — the bridge between counting and predicting
Pointwise Mutual Information: pmi(w, c) = log P(w, c) / (P(w) P(c)). Positive PMI: clip negatives to 0. The famous Levy & Goldberg (2014) result is that Skip-gram with negative sampling implicitly factorizes a shifted PMI matrix. So the seemingly different paradigms compute almost the same thing under the hood.
2. Word2Vec — Skip-Gram with Negative Sampling
This is the core of Lab 01. Internalize it.
2.1 The Skip-Gram task
Given a center word w_c (e.g., "neural"), predict its surrounding context words w_o within a window (e.g., the 5 words before and after). The model parameters are two embedding matrices:
- Input embeddings
V ∈ ℝ^{|vocab| × d}—V[w]is the vector whenwis the center. - Output embeddings
U ∈ ℝ^{|vocab| × d}—U[w]is the vector whenwis a context.
Probability that w_o is a context for w_c:
$$ P(w_o \mid w_c) = \frac{\exp(U_{w_o}^\top V_{w_c})}{\sum_{w} \exp(U_w^\top V_{w_c})} $$
This denominator sums over the entire vocabulary at every step. That's prohibitive (vocab can be millions). Two historical fixes:
- Hierarchical softmax — replace the flat softmax with a binary tree (Huffman code) so each prediction is a sequence of
log Vbinary choices. O(log V). - Negative sampling — don't compute the partition function at all; turn it into a binary classification task.
2.2 Negative sampling derivation
For each true (center, context) pair (w_c, w_o), sample K "negative" context words w_neg ~ P_n(w). Train a logistic regression: predict 1 for the true pair, 0 for each negative.
Loss for a single positive example with K negatives:
$$ \mathcal{L} = -\log \sigma(U_{w_o}^\top V_{w_c}) - \sum_{k=1}^K \mathbb{E}{w_k \sim P_n}\left[\log \sigma(-U{w_k}^\top V_{w_c})\right] $$
where σ is the sigmoid. Notice: each gradient step touches only K + 1 rows of U instead of all |vocab|. That's the speedup.
The negative distribution is the unigram raised to 0.75:
$$ P_n(w) \propto f(w)^{0.75} $$
This empirical heuristic dampens very frequent words and boosts rare ones. The 0.75 is mostly folklore (Mikolov et al. tried a few values and it worked).
2.3 Subsampling frequent words
Mikolov also discards each occurrence of word w with probability:
$$ P_\text{discard}(w) = 1 - \sqrt{\frac{t}{f(w)}} $$
with t ≈ 1e-5. This removes noise from "the", "and", etc., yielding both faster training and higher-quality vectors.
2.4 Why analogies work (linear structure)
The famous king − man + woman ≈ queen is a consequence of how the model encodes multiple, additive semantic axes. If "royal-ness" and "gender" are roughly orthogonal directions in the embedding space, then subtracting "man" from "king" removes the gender component and adding "woman" restores it as female. Levy & Goldberg (2014) and Arora et al. (2016) explain this rigorously; the short version: log-bilinear models produce vectors whose inner products approximate PMI, and PMI has linear-additive structure for many semantic features.
2.5 Two main objectives — Skip-Gram vs CBOW
- Skip-Gram: predict context given center. Better for rare words.
- CBOW (Continuous Bag of Words): predict center given averaged context. Faster to train.
Skip-gram with negative sampling won historically.
2.6 References
- Mikolov, Sutskever, Chen, Corrado, Dean (2013), Distributed Representations of Words and Phrases and their Compositionality — the SGNS paper.
- Mikolov, Chen, Corrado, Dean (2013), Efficient Estimation of Word Representations in Vector Space.
- Levy & Goldberg (2014), Neural Word Embedding as Implicit Matrix Factorization.
- Goldberg's chapter word2vec Explained (free).
- Goldberg, Neural Network Methods for Natural Language Processing (Morgan & Claypool) — the best textbook for this era.
3. GloVe and FastText (the cousins)
3.1 GloVe
Pennington, Socher, Manning (2014) at Stanford. Trains on the logarithm of the co-occurrence matrix directly with a weighted least-squares loss:
$$ \mathcal{L} = \sum_{i, j} f(X_{ij}) \left(w_i^\top \tilde{w}_j + b_i + \tilde{b}j - \log X{ij}\right)^2 $$
The weighting f(X_{ij}) damps very frequent pairs. GloVe sits philosophically between count-based and predict-based methods.
3.2 FastText
Bojanowski, Grave, Joulin, Mikolov (2017). Each word is represented as the sum of its character n-gram vectors. So "where" = <wh + whe + her + ere + re> + <where>. Two huge wins:
- OOV handling: any new word can be embedded by summing its n-grams.
- Morphology: Inflected forms (run/runs/running/ran) share n-grams and thus geometry.
FastText is still a great default for non-English languages (Arabic, Finnish, Turkish) and tasks where a small model + cold-start handling matters.
3.3 References
- Pennington, Socher, Manning (2014), GloVe: Global Vectors for Word Representation.
- Bojanowski et al. (2017), Enriching Word Vectors with Subword Information.
4. Sentence and Document Embeddings
A single vector per word doesn't help if you want to retrieve passages. Three eras:
- Average / weighted-average of word vectors (Arora SIF). Dirt simple, surprisingly effective baseline.
- InferSent / Universal Sentence Encoder — supervised on NLI / multitask data.
- Contrastive sentence transformers (Phase 7 will use these): SBERT, E5, BGE, Cohere embed-v3, OpenAI text-embedding-3. Trained with triplet loss or InfoNCE on (query, positive, negative) pairs.
The key idea connecting Phases 2 → 7: a contrastive loss is just negative sampling on sentence pairs. The math is the same; the unit is bigger.
Read: Reimers & Gurevych (2019), Sentence-BERT. Wang et al. (2022), Text Embeddings by Weakly-Supervised Contrastive Pre-training (E5).
5. Evaluating Embeddings
5.1 Intrinsic
- Word similarity: human-rated pairs (WordSim-353, SimLex-999). Score: Spearman correlation between cosine and human ratings.
- Analogies: Google analogy set (
king:man :: queen:?). Score: top-1 accuracy onarg max_w cos(w, b - a + c). - Clustering coherence: do nearest neighbors of "Java" all relate to programming or to coffee?
5.2 Extrinsic
- Plug embeddings into downstream tasks (sentiment classification, NER, retrieval) and measure end-task metric.
- Extrinsic almost always wins as the truth — intrinsic benchmarks can be gamed.
5.3 Bias auditing
Bolukbasi et al. (2016), Man is to Computer Programmer as Woman is to Homemaker? showed word2vec encodes gender bias along measurable axes. This is a recurring topic in safety interviews.
6. Dimensionality Reduction & Visualization
To inspect a 300-dim embedding space, project to 2D:
- PCA: linear, fast. Use as a first glance.
- t-SNE (van der Maaten & Hinton, 2008): nonlinear, preserves local neighborhoods. Notoriously misleading at the global scale (clusters far apart in t-SNE may be close in reality).
- UMAP (McInnes, Healy, Melville, 2018): nonlinear, faster than t-SNE, preserves more global structure. Default in 2024+.
Always plot a known-labeled subset (countries, animals, programming languages) to sanity-check that semantic clusters appear.
7. Lab 01 walkthrough (lab-01-word2vec-from-scratch)
The lab implements Skip-Gram + negative sampling end to end. Things to internalize while reading the solution:
- Vocab construction — frequency cutoff, then index assignment. Output a dict
{word: id}and its inverse. - Subsampling — apply Mikolov's discard rule per occurrence.
- Iterable dataset — yields
(center, positive_context, negatives[K])tuples. TheIterableDatasetpattern is essential for streaming corpora that don't fit in memory. - Negative sampling distribution — pre-compute a frequency-table
^0.75once; sample by binary search on cumulative. - Forward:
score = sigmoid(U[w_o] · V[w_c]). Loss =BCEon the binary labels. - Two embedding matrices: input and output. The "word vector" you keep at inference is the input matrix
V(or sum of both — Mikolov usedVonly). - Nearest-neighbor demo — cosine over a
(V, d)matrix → top-k.
Things you should be able to explain afterwards:
- Why two embedding matrices and not one?
- What happens if
K = 0? (Degenerates; can't learn discrimination.) - Why is
^0.75a "good enough" hack? - How would you make this run in a few minutes on a single GPU? (Bigger batches, fewer epochs, smaller window.)
8. Common interview questions on Phase 2 material
- Derive Skip-gram with negative sampling on the whiteboard.
- Why does
king − man + woman ≈ queenwork? - What's the difference between word2vec, GloVe, and FastText?
- Why is the unigram raised to 0.75 in negative sampling?
- How would you handle a new word that wasn't in your vocab? (FastText answer.)
- What's PMI, and how does it relate to word2vec?
- Compare static embeddings (word2vec) to contextual ones (BERT). When would you still use the former?
- How do you evaluate an embedding model?
- What's the complexity of a softmax over a 1M-word vocab? How do hierarchical softmax and negative sampling help?
- Explain the connection between negative sampling and the modern InfoNCE / contrastive loss.
9. From solid → exceptional
- Reimplement word2vec with NEG-K loss in pure NumPy (no PyTorch). It's
~150 lines. - Train on a 1 GB Wikipedia dump; evaluate on Google analogies; report both accuracy and training time.
- Re-derive the gradient updates for SGNS by hand. Confirm against autograd.
- Read the Levy & Goldberg paper and explain in your own words why SGNS factorizes shifted PMI.
- Train fastText on Arabic Wikipedia and show it handles morphology via n-gram averaging.
- Compare nearest neighbors in your word2vec space with those from BGE — note the differences (BGE captures factual relatedness; word2vec captures distributional similarity).
10. Recommended cadence
| Day | Activity |
|---|---|
| Mon | Read Goldberg's word2vec Explained + Mikolov 2013 paper |
| Tue | Read Levy & Goldberg 2014 (PMI factorization result) |
| Wed | Lab 01 — implement SGNS without looking at solution |
| Thu | Train word2vec on text8; evaluate on analogies; visualize with UMAP |
| Fri | Skim GloVe and FastText papers; compare to your implementation |
| Sat | Read SBERT paper (preview of Phase 7) |
| Sun | Practice the 10 interview questions out loud |