Phase 2 — Classical NLP & Static Embeddings

Difficulty: ⭐⭐⭐☆☆ | Estimated Time: 1.5 weeks Roles supported: Pretraining Data Engineer, Research Engineer, Foundation Model Engineer.


Why This Phase Exists

Static embeddings (Word2Vec, GloVe, FastText) are the conceptual ancestors of every modern embedding model used in RAG, retrieval, and the input layer of every LLM. Implementing them from scratch teaches you negative sampling, contrastive objectives, and embedding evaluation — all of which reappear at scale in CLIP, sentence-transformers, and reward models.

You will leave this phase able to explain "what an embedding actually is" without hand-waving.


Concepts

  • Distributional hypothesis
  • CBOW vs Skip-gram
  • Negative sampling derivation (and why it approximates softmax)
  • Subsampling of frequent words
  • Hierarchical softmax (overview)
  • GloVe: co-occurrence matrix factorization
  • FastText: subword n-grams, OOV handling
  • Embedding evaluation: intrinsic (analogy, similarity) vs extrinsic (downstream task)
  • Dimensionality reduction for visualization (t-SNE, UMAP)
  • Anisotropy of embedding spaces

Labs

Lab 01 — Word2Vec Skip-Gram From Scratch (NumPy + PyTorch)

FieldValue
GoalTrain skip-gram with negative sampling on text8 and recover semantic structure.
ConceptsSkip-gram objective, negative sampling, subsampling, vocab construction, embedding lookup.
Steps1) Build vocab + frequency table from text8. 2) Subsample frequent words (Mikolov formula). 3) Generate (center, context) + negative pairs. 4) Define nn.Embedding for input + output. 5) Sigmoid loss. 6) Train ~5 epochs on text8. 7) Find nearest neighbors.
StackPyTorch, NumPy
Datasetstext8 (100 MB cleaned Wikipedia)
OutputA vectors.bin file; nearest-neighbor demo (king, paris, python); analogy demo (king - man + woman ≈ queen).
How to TestWordSim-353 Spearman correlation > 0.55; analogy accuracy > 30% on Google analogy set.
Talking PointsWhy negative sampling works (NCE approximation). Why subsample frequent words. Why use two embedding matrices (input/output).
Resume Bullet"Implemented skip-gram with negative sampling from scratch in PyTorch, trained on text8 (100M tokens), achieving 0.61 WordSim-353 Spearman and 38% accuracy on the Google analogy benchmark."
ExtensionsAdd CBOW; add subword n-grams (FastText); analyze gender-bias direction via PCA.

Lab 02 — GloVe & FastText (Hands-On)

FieldValue
GoalImplement GloVe co-occurrence loss; use pretrained FastText to handle OOV.
ConceptsCo-occurrence matrix, weighted least-squares loss, subword n-grams, OOV via character n-grams.
Steps1) Build sparse co-occurrence matrix with windowed counts. 2) Implement weighted MSE loss. 3) Train on a 10M-token slice. 4) Compare embeddings to skip-gram on the same corpus. 5) Load pretrained FastText; query OOV (covid, transformer, made-up words).
StackPyTorch, scipy.sparse, gensim (for FastText load only)
OutputComparison table: skip-gram vs GloVe vs FastText on WordSim + analogy.
How to TestSame intrinsic eval suite.
Talking PointsWhy GloVe's loss is a weighted MSE. Why FastText handles OOV. Why none of these handle polysemy (motivates contextual embeddings → Phase 3).
Resume Bullet"Benchmarked three static-embedding methods (Skip-gram, GloVe, FastText) on a controlled 10M-token corpus, producing a reproducible report on intrinsic-eval tradeoffs and OOV behavior."
ExtensionsQuantitatively measure anisotropy (Ethayarajh 2019).

Lab 03 — Embedding Evaluation & Visualization

FieldValue
GoalBuild a reusable embedding-evaluation harness used throughout later phases.
ConceptsWordSim-353, SimLex-999, Google analogy, MTEB overview, t-SNE/UMAP.
Steps1) Load WordSim/SimLex/analogy datasets. 2) Implement Spearman + analogy accuracy. 3) Plot 2D t-SNE/UMAP of 5k most-frequent words. 4) Highlight country-capital pairs.
StackPyTorch, scikit-learn (t-SNE), umap-learn
OutputA eval_embeddings.py module + a side-by-side visualization plot.
How to TestRun on known-good pretrained vectors (glove.6B.300d); reproduce published numbers within 1%.
Talking PointsWhy intrinsic eval correlates poorly with downstream task performance. The shift to MTEB for sentence embeddings.
Resume Bullet"Built a reusable embedding-evaluation harness covering WordSim/SimLex/Google-analogy + t-SNE visualization; reproduced published GloVe-300d numbers within 1%."
ExtensionsExtend to MTEB-lite (3 sentence-level tasks) — used in Phase 7 RAG embeddings selection.

Deliverables Checklist

  • Skip-gram trained on text8 with intrinsic eval > 0.55
  • GloVe + FastText comparison report
  • Embedding eval harness reusable in later phases
  • t-SNE / UMAP visualization

Interview Relevance

  • "Explain negative sampling."
  • "Why are static embeddings insufficient for modern NLP?"
  • "How would you evaluate an embedding model for a RAG system?" (sets up Phase 7)