Phase 2 — Classical NLP & Static Embeddings

Difficulty: ⭐⭐⭐☆☆ | Estimated Time: 1.5 weeks Roles supported: Pretraining Data Engineer, Research Engineer, Foundation Model Engineer.

Why This Phase Exists

Static embeddings (Word2Vec, GloVe, FastText) are the conceptual ancestors of every modern embedding model used in RAG, retrieval, and the input layer of every LLM. Implementing them from scratch teaches you negative sampling, contrastive objectives, and embedding evaluation — all of which reappear at scale in CLIP, sentence-transformers, and reward models.

You will leave this phase able to explain "what an embedding actually is" without hand-waving.

Concepts

Distributional hypothesis
CBOW vs Skip-gram
Negative sampling derivation (and why it approximates softmax)
Subsampling of frequent words
Hierarchical softmax (overview)
GloVe: co-occurrence matrix factorization
FastText: subword n-grams, OOV handling
Embedding evaluation: intrinsic (analogy, similarity) vs extrinsic (downstream task)
Dimensionality reduction for visualization (t-SNE, UMAP)
Anisotropy of embedding spaces

Labs

Lab 01 — Word2Vec Skip-Gram From Scratch (NumPy + PyTorch)

Field	Value
Goal	Train skip-gram with negative sampling on `text8` and recover semantic structure.
Concepts	Skip-gram objective, negative sampling, subsampling, vocab construction, embedding lookup.
Steps	1) Build vocab + frequency table from text8. 2) Subsample frequent words (Mikolov formula). 3) Generate (center, context) + negative pairs. 4) Define `nn.Embedding` for input + output. 5) Sigmoid loss. 6) Train ~5 epochs on text8. 7) Find nearest neighbors.
Stack	PyTorch, NumPy
Datasets	text8 (100 MB cleaned Wikipedia)
Output	A `vectors.bin` file; nearest-neighbor demo (`king`, `paris`, `python`); analogy demo (`king - man + woman ≈ queen`).
How to Test	WordSim-353 Spearman correlation > 0.55; analogy accuracy > 30% on Google analogy set.
Talking Points	Why negative sampling works (NCE approximation). Why subsample frequent words. Why use two embedding matrices (input/output).
Resume Bullet	"Implemented skip-gram with negative sampling from scratch in PyTorch, trained on text8 (100M tokens), achieving 0.61 WordSim-353 Spearman and 38% accuracy on the Google analogy benchmark."
Extensions	Add CBOW; add subword n-grams (FastText); analyze gender-bias direction via PCA.

Lab 02 — GloVe & FastText (Hands-On)

Field	Value
Goal	Implement GloVe co-occurrence loss; use pretrained FastText to handle OOV.
Concepts	Co-occurrence matrix, weighted least-squares loss, subword n-grams, OOV via character n-grams.
Steps	1) Build sparse co-occurrence matrix with windowed counts. 2) Implement weighted MSE loss. 3) Train on a 10M-token slice. 4) Compare embeddings to skip-gram on the same corpus. 5) Load pretrained FastText; query OOV (`covid`, `transformer`, made-up words).
Stack	PyTorch, scipy.sparse, `gensim` (for FastText load only)
Output	Comparison table: skip-gram vs GloVe vs FastText on WordSim + analogy.
How to Test	Same intrinsic eval suite.
Talking Points	Why GloVe's loss is a weighted MSE. Why FastText handles OOV. Why none of these handle polysemy (motivates contextual embeddings → Phase 3).
Resume Bullet	"Benchmarked three static-embedding methods (Skip-gram, GloVe, FastText) on a controlled 10M-token corpus, producing a reproducible report on intrinsic-eval tradeoffs and OOV behavior."
Extensions	Quantitatively measure anisotropy (Ethayarajh 2019).

Lab 03 — Embedding Evaluation & Visualization

Field	Value
Goal	Build a reusable embedding-evaluation harness used throughout later phases.
Concepts	WordSim-353, SimLex-999, Google analogy, MTEB overview, t-SNE/UMAP.
Steps	1) Load WordSim/SimLex/analogy datasets. 2) Implement Spearman + analogy accuracy. 3) Plot 2D t-SNE/UMAP of 5k most-frequent words. 4) Highlight country-capital pairs.
Stack	PyTorch, scikit-learn (t-SNE), umap-learn
Output	A `eval_embeddings.py` module + a side-by-side visualization plot.
How to Test	Run on known-good pretrained vectors (`glove.6B.300d`); reproduce published numbers within 1%.
Talking Points	Why intrinsic eval correlates poorly with downstream task performance. The shift to MTEB for sentence embeddings.
Resume Bullet	"Built a reusable embedding-evaluation harness covering WordSim/SimLex/Google-analogy + t-SNE visualization; reproduced published GloVe-300d numbers within 1%."
Extensions	Extend to MTEB-lite (3 sentence-level tasks) — used in Phase 7 RAG embeddings selection.

Deliverables Checklist

Skip-gram trained on text8 with intrinsic eval > 0.55
GloVe + FastText comparison report
Embedding eval harness reusable in later phases
t-SNE / UMAP visualization

Interview Relevance

"Explain negative sampling."
"Why are static embeddings insufficient for modern NLP?"
"How would you evaluate an embedding model for a RAG system?" (sets up Phase 7)

LLM Inference Engineer