Phase 2 — Classical NLP & Static Embeddings
Difficulty: ⭐⭐⭐☆☆ | Estimated Time: 1.5 weeks Roles supported: Pretraining Data Engineer, Research Engineer, Foundation Model Engineer.
Why This Phase Exists
Static embeddings (Word2Vec, GloVe, FastText) are the conceptual ancestors of every modern embedding model used in RAG, retrieval, and the input layer of every LLM. Implementing them from scratch teaches you negative sampling, contrastive objectives, and embedding evaluation — all of which reappear at scale in CLIP, sentence-transformers, and reward models.
You will leave this phase able to explain "what an embedding actually is" without hand-waving.
Concepts
- Distributional hypothesis
- CBOW vs Skip-gram
- Negative sampling derivation (and why it approximates softmax)
- Subsampling of frequent words
- Hierarchical softmax (overview)
- GloVe: co-occurrence matrix factorization
- FastText: subword n-grams, OOV handling
- Embedding evaluation: intrinsic (analogy, similarity) vs extrinsic (downstream task)
- Dimensionality reduction for visualization (t-SNE, UMAP)
- Anisotropy of embedding spaces
Labs
Lab 01 — Word2Vec Skip-Gram From Scratch (NumPy + PyTorch)
| Field | Value |
|---|---|
| Goal | Train skip-gram with negative sampling on text8 and recover semantic structure. |
| Concepts | Skip-gram objective, negative sampling, subsampling, vocab construction, embedding lookup. |
| Steps | 1) Build vocab + frequency table from text8. 2) Subsample frequent words (Mikolov formula). 3) Generate (center, context) + negative pairs. 4) Define nn.Embedding for input + output. 5) Sigmoid loss. 6) Train ~5 epochs on text8. 7) Find nearest neighbors. |
| Stack | PyTorch, NumPy |
| Datasets | text8 (100 MB cleaned Wikipedia) |
| Output | A vectors.bin file; nearest-neighbor demo (king, paris, python); analogy demo (king - man + woman ≈ queen). |
| How to Test | WordSim-353 Spearman correlation > 0.55; analogy accuracy > 30% on Google analogy set. |
| Talking Points | Why negative sampling works (NCE approximation). Why subsample frequent words. Why use two embedding matrices (input/output). |
| Resume Bullet | "Implemented skip-gram with negative sampling from scratch in PyTorch, trained on text8 (100M tokens), achieving 0.61 WordSim-353 Spearman and 38% accuracy on the Google analogy benchmark." |
| Extensions | Add CBOW; add subword n-grams (FastText); analyze gender-bias direction via PCA. |
Lab 02 — GloVe & FastText (Hands-On)
| Field | Value |
|---|---|
| Goal | Implement GloVe co-occurrence loss; use pretrained FastText to handle OOV. |
| Concepts | Co-occurrence matrix, weighted least-squares loss, subword n-grams, OOV via character n-grams. |
| Steps | 1) Build sparse co-occurrence matrix with windowed counts. 2) Implement weighted MSE loss. 3) Train on a 10M-token slice. 4) Compare embeddings to skip-gram on the same corpus. 5) Load pretrained FastText; query OOV (covid, transformer, made-up words). |
| Stack | PyTorch, scipy.sparse, gensim (for FastText load only) |
| Output | Comparison table: skip-gram vs GloVe vs FastText on WordSim + analogy. |
| How to Test | Same intrinsic eval suite. |
| Talking Points | Why GloVe's loss is a weighted MSE. Why FastText handles OOV. Why none of these handle polysemy (motivates contextual embeddings → Phase 3). |
| Resume Bullet | "Benchmarked three static-embedding methods (Skip-gram, GloVe, FastText) on a controlled 10M-token corpus, producing a reproducible report on intrinsic-eval tradeoffs and OOV behavior." |
| Extensions | Quantitatively measure anisotropy (Ethayarajh 2019). |
Lab 03 — Embedding Evaluation & Visualization
| Field | Value |
|---|---|
| Goal | Build a reusable embedding-evaluation harness used throughout later phases. |
| Concepts | WordSim-353, SimLex-999, Google analogy, MTEB overview, t-SNE/UMAP. |
| Steps | 1) Load WordSim/SimLex/analogy datasets. 2) Implement Spearman + analogy accuracy. 3) Plot 2D t-SNE/UMAP of 5k most-frequent words. 4) Highlight country-capital pairs. |
| Stack | PyTorch, scikit-learn (t-SNE), umap-learn |
| Output | A eval_embeddings.py module + a side-by-side visualization plot. |
| How to Test | Run on known-good pretrained vectors (glove.6B.300d); reproduce published numbers within 1%. |
| Talking Points | Why intrinsic eval correlates poorly with downstream task performance. The shift to MTEB for sentence embeddings. |
| Resume Bullet | "Built a reusable embedding-evaluation harness covering WordSim/SimLex/Google-analogy + t-SNE visualization; reproduced published GloVe-300d numbers within 1%." |
| Extensions | Extend to MTEB-lite (3 sentence-level tasks) — used in Phase 7 RAG embeddings selection. |
Deliverables Checklist
- Skip-gram trained on text8 with intrinsic eval > 0.55
- GloVe + FastText comparison report
- Embedding eval harness reusable in later phases
- t-SNE / UMAP visualization
Interview Relevance
- "Explain negative sampling."
- "Why are static embeddings insufficient for modern NLP?"
- "How would you evaluate an embedding model for a RAG system?" (sets up Phase 7)