LLM / Foundation-Model Engineer — Complete Learning Curriculum

Target Roles:

Research Engineer, Pretraining (Anthropic, OpenAI, DeepMind, Meta, Mistral, xAI)
LLM Infrastructure Engineer / ML Systems Engineer
Foundation Model Engineer
Post-training / Fine-tuning Engineer (RLHF, DPO, SFT)
LLM Inference Engineer (vLLM/TGI/TensorRT-LLM class work)
Model Evaluation Engineer
Pretraining Data Engineer
Applied AI / Production AI Engineer

Duration: 24 weeks core (6 months) — extendable to 12 months for deep specialization Goal: Reach interview-ready expertise with a portfolio competitive for senior LLM/foundation-model roles at frontier labs.

Why This Curriculum Exists

The hiring bar at frontier labs (Anthropic, OpenAI, DeepMind, Meta AI, Mistral, xAI, Cohere) is not "have you used ChatGPT" — it is "can you implement attention from scratch, debug a 64-GPU training run, profile a CUDA kernel, design a 100k-QPS inference gateway, and explain why DPO converges differently than PPO".

This curriculum is built backward from real job postings (referenced below) and is structured so that every lab maps to a real interview question or production system you would build on the job.

Reference Job Targets

Anthropic — Research Engineer, Pretraining (JD) → Phases 4, 5, 10, Capstone 1
Anthropic — Research Engineer, Production Model Post-Training → Phases 6, 8, Capstone 4
OpenAI — Research Engineer, Applied AI (JD) → Phases 7, 9, Capstones 2 & 3
Google DeepMind — Research Engineer, Gemini Latent Thinking → Phases 4, 5, 6, 8
Meta AI — Research / Production roles (Careers) → Phases 5, 9, 10

What You Will Build

By the end of this curriculum you will have shipped:

A working BPE tokenizer that matches GPT-2 output byte-for-byte
Word2Vec, attention, and a transformer block — all from scratch in NumPy and PyTorch
A nanoGPT-style model trained on a custom corpus (TinyStories or your own)
A LoRA / QLoRA fine-tuning pipeline on an open 7B model
A DPO preference-optimization run with reward analysis
A production-grade RAG system with hybrid retrieval, re-ranking, and an eval harness
An inference gateway with continuous batching, KV-cache, streaming, quantization, observability
A pretraining data pipeline with deduplication (MinHash), quality filtering (FastText/heuristics), and tokenization at scale
A multi-GPU training experiment using FSDP / DDP with mixed precision and gradient accumulation
An evaluation harness comparing base, fine-tuned, and RAG-augmented models on MMLU/HellaSwag/HumanEval-style tasks
A complete portfolio of 10+ GitHub repos with READMEs, benchmarks, diagrams, and ablations

Folder Structure

llm-inference-engineer/
├── README.md                              ← You are here (master roadmap)
├── phase-01-foundations-text/             ← Tokenization, BoW, TF-IDF, similarity, PyTorch
├── phase-02-classical-nlp-embeddings/     ← Word2Vec, GloVe, FastText, embedding eval
├── phase-03-rnns-language-modeling/       ← RNN/LSTM/GRU, char-LM, seq2seq, Bahdanau attention
├── phase-04-attention-transformers/       ← Self-attention, MHA, positional encodings, full transformer
├── phase-05-training-small-llms/          ← Mini-GPT, BPE, training loop, mixed precision, sampling
├── phase-06-finetuning-instruction/       ← SFT, LoRA/QLoRA, instruction data, RLHF/DPO
├── phase-07-rag-retrieval/                ← Vector DBs, hybrid search, re-ranking, agents/tool use
├── phase-08-evaluation-safety/            ← Eval harness, LLM-as-judge, red-teaming, benchmarks
├── phase-09-inference-optimization/       ← KV-cache, quantization, batching, vLLM/TGI, spec decoding
├── phase-10-distributed-production/       ← DDP/FSDP, pretraining data pipeline, observability
├── phase-11-capstone/                     ← 4 portfolio-grade end-to-end systems
├── system-design/                         ← LLM-specific system design walkthroughs
└── interview-prep/                        ← Concepts, coding, ML systems, behavioral

24-Week Schedule

Week	Phase	Focus
1	1	Python/PyTorch refresh, tokenization (regex → BPE intuition)
2	1	BoW, TF-IDF from scratch, cosine-similarity search
3	2	Word2Vec skip-gram from scratch (NumPy + PyTorch)
4	2	GloVe, FastText, embedding evaluation (analogies, WordSim)
5	3	RNN forward/backward by hand, char-level language model
6	3	LSTM/GRU, gradient flow, seq2seq with Bahdanau attention
7	4	Scaled dot-product attention from scratch + masking
8	4	Multi-head attention, positional encodings (sinusoidal, RoPE, ALiBi)
9	4	Full transformer block, encoder/decoder/decoder-only variants
10	5	BPE tokenizer matching GPT-2; nanoGPT architecture
11	5	Training loop, mixed precision, grad accumulation, checkpointing
12	5	Sampling: greedy, top-k, top-p, temperature, beam, contrastive
13	6	Supervised fine-tuning (SFT) on instruction data
14	6	LoRA + QLoRA on a 7B open model
15	6	Reward modeling, DPO/IPO/KTO preference optimization
16	7	Embedding pipelines, vector DBs (FAISS, pgvector, Qdrant)
17	7	Hybrid retrieval (BM25 + dense), re-ranking, RAG eval
18	7	Agents, tool use, structured outputs, function calling
19	8	Eval harness (lm-eval-harness style), MMLU/HellaSwag scoring
20	8	LLM-as-judge, RAGAS, red-teaming, safety filters
21	9	KV-cache deep dive, paged attention, continuous batching
22	9	Quantization (INT8, INT4, AWQ, GPTQ), speculative decoding
23	10	DDP/FSDP, ZeRO, pretraining data pipeline (dedup, filter, tokenize)
24	11	Capstone integration + interview prep review

Each Lab Structure

Every lab folder contains:

File	Purpose
`README.md`	Theory, math derivations, design rationale, interview Q&A, talking points
`lab.py`	Guided exercise with `# TODO` markers — you fill in the blanks
`solution.py`	Reference solution with inline commentary
`requirements.txt`	Pinned pip dependencies
`DATASETS.md`	Where applicable — download links and expected layout

Project Specification Template

Every non-trivial project in this curriculum is described with the same template, so you can lift any lab into a portfolio-ready repo:

Field	What it Captures
Project Title	Short, resume-friendly name
Goal	One sentence: what problem does this solve?
Concepts Learned	The 3–7 core ideas you internalize
Implementation Steps	Ordered checklist of what you build
Suggested Tech Stack	Libraries, frameworks, hardware tier
Dataset Suggestions	Specific datasets with sizes
Expected Output	Concrete artifact (model, plot, metric, server)
How to Test	Unit tests, sanity benchmarks, ablations
Interview Talking Points	Tradeoffs and design decisions to discuss
Resume Bullet Examples	Quantified achievement statements
Extensions	How to make the project portfolio-grade

The phase READMEs (phase-XX/README.md) instantiate this template for every lab.

Prerequisites

Python 3.10+
Comfort with backend / distributed systems (you have this)
Basic linear algebra (matrix multiply, eigenvectors) — Phase 1 has a refresher
A Hugging Face account (free) for model + dataset access
Optional: Weights & Biases / Comet ML account for experiment tracking

Hardware Recommendations

Tier	Setup	Best For
Minimal	CPU laptop (16 GB RAM)	Phases 1–4, tiny models, NumPy from-scratch work
Mid	1× consumer GPU (RTX 3090/4090, 24 GB)	Phases 5–9, fine-tuning ≤7B with QLoRA
Recommended	1× A100 40 GB or 2× 4090	Phase 5 nanoGPT training, full SFT on 7B
Cloud (cheap)	RunPod / Lambda / Vast.ai spot A100 — $1–2/hr	Phases 6, 9, 10 — pay only when training
Free tier	Google Colab T4, Kaggle P100	Almost all labs in scaled-down form

You do NOT need a GPU cluster. Every lab in this curriculum has a "small-model mode" that runs on Colab free tier. Capstones can be completed for under $50 of cloud GPU time.

System Design Philosophy

Every production-oriented lab (Phases 7, 9, 10) is evaluated on the same five axes that frontier-lab interviewers care about:

Throughput — tokens/sec at the system level (not just the model)
Latency — TTFT (time-to-first-token) and TPOT (time-per-output-token), P50/P99
Memory efficiency — KV-cache size, activation memory, parameter offloading
Cost — $/million-tokens served, $/training-run, GPU-hour utilization
Observability — request tracing, token-level metrics, drift detection, eval-in-production

Each capstone explicitly reports numbers on these axes.

Phase-by-Phase Overview

Each phase has its own README.md with full lab specs, concept list, deliverables, and interview questions. Below is the index — click into the phase folder for depth.

Phase 1 — Foundations: Text, Math, PyTorch

Concepts: Tokenization (whitespace → regex → byte-level), bag-of-words, TF-IDF, cosine similarity, PyTorch tensors/autograd, broadcasting, CPU/GPU dispatch. Difficulty: ⭐⭐☆☆☆ | Time: 1–2 weeks Deliverables: From-scratch TF-IDF search engine over a Wikipedia subset; PyTorch tensor playground notebook. Roles supported: All — this is non-negotiable foundation.

Phase 2 — Classical NLP & Static Embeddings

Concepts: Word2Vec (CBOW + skip-gram), negative sampling, GloVe, FastText subword, embedding evaluation (analogy, WordSim353), dimensionality reduction. Difficulty: ⭐⭐⭐☆☆ | Time: 1.5 weeks Deliverables: Skip-gram trained from scratch on text8; embedding visualization (t-SNE/UMAP); analogy benchmark report. Roles supported: Pretraining Data Engineer, Research Engineer.

Phase 3 — RNNs & Language Modeling

Concepts: Vanilla RNN forward/backward, vanishing gradients, LSTM gates, GRU, sequence-to-sequence, Bahdanau additive attention, teacher forcing, perplexity. Difficulty: ⭐⭐⭐☆☆ | Time: 1.5 weeks Deliverables: Char-RNN trained on Shakespeare; LSTM seq2seq translator (toy). Roles supported: Foundation Model Engineer (historical context); strong "explain attention" interview answer.

Phase 4 — Attention & Transformers (From Scratch)

Concepts: Scaled dot-product attention, masking (causal/padding), multi-head, sinusoidal/RoPE/ALiBi positional encodings, layer norm vs RMSNorm, residual streams, encoder/decoder/decoder-only. Difficulty: ⭐⭐⭐⭐☆ | Time: 2 weeks Deliverables: 200-line transformer that passes attention shape tests; visualized attention maps; ablation report (pre-norm vs post-norm). Roles supported: All research-engineer roles. The most-asked interview topic.

Phase 5 — Training Small LLMs

Concepts: BPE tokenization (matching GPT-2), nanoGPT architecture, AdamW, cosine LR schedule, mixed precision (BF16/FP16), gradient accumulation, gradient clipping, checkpointing, sampling (greedy/top-k/top-p/temperature/beam). Difficulty: ⭐⭐⭐⭐☆ | Time: 2.5 weeks Deliverables: BPE tokenizer matching tiktoken on test corpus; nanoGPT trained on TinyStories with W&B logs and loss curves. Roles supported: Research Engineer Pretraining, Foundation Model Engineer.

Phase 6 — Fine-tuning, Instruction Tuning, Preference Optimization

Concepts: SFT, chat templates, LoRA / QLoRA (NF4), reward modeling, RLHF (PPO conceptual), DPO / IPO / KTO, RLAIF, constitutional AI. Difficulty: ⭐⭐⭐⭐☆ | Time: 2.5 weeks Deliverables: QLoRA fine-tune of Llama-3-8B or Qwen2-7B on a domain dataset; DPO run with preference dataset; before/after eval table. Roles supported: Post-training Engineer, Production Model Post-Training (Anthropic-style).

Phase 7 — RAG, Retrieval, Agents

Concepts: Embedding models (sentence-transformers, E5, BGE), FAISS vs HNSW vs IVF, hybrid retrieval (BM25 + dense), re-ranking (cross-encoder, ColBERT), chunking strategies, query rewriting, agent loops, tool use, structured output (JSON schema, constrained decoding). Difficulty: ⭐⭐⭐⭐☆ | Time: 2 weeks Deliverables: Production-style RAG over a real corpus with eval (RAGAS); agent that uses 3+ tools. Roles supported: Applied AI Engineer (OpenAI-style), LLM Inference Engineer.

Phase 8 — Evaluation & Safety

Concepts: Benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MT-Bench), perplexity vs downstream eval, LLM-as-judge bias, RAGAS, red-teaming, jailbreak taxonomy, safety classifiers. Difficulty: ⭐⭐⭐⭐☆ | Time: 1.5 weeks Deliverables: Forked lm-evaluation-harness task; LLM-as-judge harness with bias analysis; red-team report. Roles supported: Model Evaluation Engineer, Safety roles.

Phase 9 — Inference Optimization & Serving

Concepts: KV-cache mechanics + memory math, paged attention (vLLM), continuous batching, INT8/INT4 quantization (GPTQ, AWQ, bitsandbytes), speculative decoding, prefix caching, FlashAttention-2/3, CUDA graphs, TensorRT-LLM, streaming via SSE. Difficulty: ⭐⭐⭐⭐⭐ | Time: 2.5 weeks Deliverables: Custom inference server with KV-cache + continuous batching + INT4 quantization; benchmark report (TTFT/TPOT/throughput). Roles supported: LLM Inference Engineer, ML Systems Engineer. Highest-leverage phase for infrastructure roles.

Phase 10 — Distributed Training & Pretraining Data

Concepts: DDP, FSDP, ZeRO-1/2/3, tensor/pipeline parallelism (conceptual), mixed precision strategies, NCCL, gradient checkpointing, activation recomputation, MinHash dedup, quality filtering (perplexity, FastText, heuristics), tokenization at scale, Common Crawl pipeline. Difficulty: ⭐⭐⭐⭐⭐ | Time: 2 weeks Deliverables: 2-GPU FSDP training run (rentable for ~$5); pretraining data pipeline processing 10 GB → deduped + tokenized shards. Roles supported: Pretraining Data Engineer, ML Infrastructure Engineer, Research Engineer Pretraining.

Phase 11 — Capstone Projects

Four portfolio-grade systems. Pick at least 2 to ship publicly.

Mini-GPT pretrained on a custom corpus (your dataset, full pipeline, model card)
Production RAG with eval (hybrid retrieval, RAGAS, A/B harness)
LLM inference gateway (KV-cache, batching, quantization, streaming, observability)
Domain-assistant fine-tune (SFT + DPO + eval comparison vs base)

The Top 10 Projects to Prioritize (Resume-Critical)

These are the projects that, when present on a portfolio, change interview outcomes:

#	Project	Phase	Why It Matters
1	BPE tokenizer matching GPT-2	5	Proves you understand pretraining stack from byte 0
2	Attention from scratch + visualizations	4	The single most-asked LLM interview topic
3	nanoGPT trained on TinyStories	5	End-to-end training credibility
4	QLoRA fine-tune of a 7B model	6	Demonstrates GPU-efficient post-training
5	DPO run with reward analysis	6	Modern preference-optimization fluency
6	Production RAG with RAGAS eval	7	The most common "applied AI" interview project
7	Inference gateway (KV-cache + batching + INT4)	9	Direct fit for LLM Inference Engineer roles
8	Eval harness (base vs fine-tune vs RAG)	8	Shows scientific rigor
9	Pretraining data pipeline (dedup + filter + tokenize)	10	Direct fit for Pretraining Data Engineer roles
10	FSDP training run with profiling	10	Distributed-training credibility

A Recommended Learning Order

1 → 2 → 3 → 4  (theory + scratch builds — sequential, no skipping)
        ↓
        5  (training mechanics — sequential)
        ↓
        ├── 6  (fine-tuning) ──┐
        ├── 7  (RAG)        ──┼──> 8 (evaluation ties everything together)
        └── 9  (inference)  ──┘
                ↓
                10 (distributed) → 11 (capstones)

You can swap the order of 6 / 7 / 9 based on the role you're targeting.

Job Titles to Search For

Use these exact strings on LinkedIn / Greenhouse / Ashby / company career pages:

"Research Engineer, Pretraining"
"Research Engineer, Post-Training"
"Research Engineer, Applied AI"
"Foundation Model Engineer"
"LLM Infrastructure Engineer"
"ML Systems Engineer (LLM)"
"LLM Inference Engineer"
"ML Performance Engineer"
"Machine Learning Engineer, Generative AI"
"Model Evaluation Engineer"
"AI Safety Engineer"
"Pretraining Data Engineer"
"Member of Technical Staff" (used by Anthropic, OpenAI, Mistral)

Skill Checklist — "Am I Ready to Apply?"

Apply when you can honestly check ✅ on at least 80% of these:

Theory

Derive attention end-to-end on a whiteboard
Implement multi-head attention from scratch in <50 lines
Explain RoPE rotation math
Compare LayerNorm vs RMSNorm and justify modern choice
Explain KV-cache memory math
Derive DPO loss from RLHF objective
Explain LoRA's rank decomposition and why it works
Compute the parameter count of a transformer given d_model, n_layers, n_heads, vocab_size

Engineering

Train a transformer from scratch end-to-end
Fine-tune a 7B+ model on a single 24GB GPU using QLoRA
Run a multi-GPU FSDP training job
Build a RAG system with hybrid retrieval and re-ranking
Quantize a model to INT4 and measure quality regression
Implement continuous batching for an inference server
Build a pretraining data pipeline with MinHash dedup

Portfolio

8+ public GitHub repos with READMEs, benchmarks, diagrams
At least 1 project with reproducible training run + W&B logs
At least 1 project with profiling output (Nsight, PyTorch profiler)
A blog post or technical writeup of one capstone
A resume with quantified, LLM-specific bullets

6-Month Plan (Aggressive, ~15 hr/week)

Month	Phases	Outcome
1	1–3	TF-IDF search, Word2Vec, char-RNN — all from scratch
2	4–5	Transformer + nanoGPT trained on TinyStories
3	5–6	Sampling strategies; QLoRA fine-tune of 7B
4	6–7	DPO + production RAG with eval
5	8–9	Eval harness; inference gateway with KV-cache + INT4
6	10–11	FSDP run + pretraining data pipeline + 2 capstones

12-Month Plan (Deeper, ~10 hr/week — recommended for career switchers)

Same as above but each month covers half the content; the extra months go to:

Months 7–8: CUDA fundamentals + Triton kernels (write a fused softmax)
Months 9–10: One frontier-paper reimplementation (FlashAttention, Mixture-of-Experts, Mamba)
Months 11–12: Capstone polish, blog posts, open-source contributions to vLLM / TGI / Transformers / lm-eval-harness

GitHub Portfolio Structure (Recommended)

your-github/
├── llm-from-scratch/                   ← Phases 1–4 in one repo (educational)
│   ├── 01-tokenization/
│   ├── 02-word2vec/
│   ├── 03-rnn-lstm/
│   └── 04-transformer/
├── nanogpt-tinystories/                ← Phase 5 capstone (single repo, polished)
├── qlora-domain-assistant/             ← Phase 6 capstone with eval
├── rag-production/                     ← Phase 7 capstone, full README + diagrams
├── llm-inference-gateway/              ← Phase 9 capstone (the hire-magnet)
├── lm-eval-harness-extension/          ← Phase 8 — contribute to upstream
├── pretraining-data-pipeline/          ← Phase 10
└── blog/                               ← MDX or plain markdown — link from each repo

Each Repo's README Should Have

One-sentence pitch above the fold
Architecture diagram (Excalidraw, Mermaid, or draw.io PNG)
Benchmarks table (numbers > prose)
Reproduction steps (make train, make eval)
Tradeoffs section — why you chose X over Y
Limitations — shows engineering maturity
What I'd do next — shows extensibility thinking

Resume Bullet Patterns

Use the action → system → quantified outcome → technical depth pattern:

"Built an LLM inference gateway supporting continuous batching, paged KV-cache, and INT4 GPTQ quantization, achieving 3.2× throughput improvement (412 → 1,317 tok/s) and 41% lower P99 TTFT on Llama-3-8B at 32 concurrent requests."

"Implemented a MinHash-LSH deduplication and FastText quality-filtering pipeline processing 180 GB of CommonCrawl WET shards into 41 GB of training-ready tokens, with reproducible Snakemake DAG and per-shard quality histograms."

"Pre-trained a 42M-parameter decoder-only transformer from scratch on TinyStories using a custom BPE tokenizer matching GPT-2, mixed precision, gradient accumulation, and cosine LR schedule on a single A100; achieved train loss 1.42 / val 1.51 in 4.2 GPU-hours."

Tools & Technologies Covered

Languages:        Python 3.11+, shell, basic CUDA/Triton overview
Core ML:          PyTorch 2.x, NumPy
Models / Libs:    Hugging Face transformers, datasets, accelerate, peft, trl
Tokenizers:       tiktoken, sentencepiece, hf-tokenizers
Training:         Lightning / pure PyTorch, FSDP, DeepSpeed (overview), bitsandbytes
Fine-tuning:      LoRA, QLoRA, DPO/IPO/KTO via trl
Retrieval:        FAISS, Qdrant, pgvector, sentence-transformers, BM25 (rank_bm25)
Eval:             lm-evaluation-harness, RAGAS, MT-Bench, HELM concepts
Inference:        vLLM, TGI, llama.cpp, TensorRT-LLM (overview), ONNX Runtime
Serving:          FastAPI, Uvicorn, Triton Inference Server (overview)
Observability:    OpenTelemetry, Prometheus, Grafana, Langfuse, W&B
Data:             pyspark / dask / polars, datasketch (MinHash), fasttext
Hardware:         CUDA, NCCL, BF16/FP16, A100/H100/L4/T4, AWQ/GPTQ

Quick Start

# 1. Navigate to the curriculum root
cd /path/to/llm-inference-engineer

# 2. Create a virtual environment
python -m venv .venv && source .venv/bin/activate

# 3. Install Phase 1 deps and start
pip install -r phase-01-foundations-text/lab-01-tokenization-from-scratch/requirements.txt
code phase-01-foundations-text/README.md

Mindset: You are not learning LLMs as an end. You are learning them well enough to build, debug, and ship the systems that frontier labs hire for. Every lab in this curriculum was designed by working backward from a real interview loop or a real production system. Do the work, ship the repos, and apply.