LLM / Foundation-Model Engineer — Complete Learning Curriculum
Target Roles:
- Research Engineer, Pretraining (Anthropic, OpenAI, DeepMind, Meta, Mistral, xAI)
- LLM Infrastructure Engineer / ML Systems Engineer
- Foundation Model Engineer
- Post-training / Fine-tuning Engineer (RLHF, DPO, SFT)
- LLM Inference Engineer (vLLM/TGI/TensorRT-LLM class work)
- Model Evaluation Engineer
- Pretraining Data Engineer
- Applied AI / Production AI Engineer
Duration: 24 weeks core (6 months) — extendable to 12 months for deep specialization Goal: Reach interview-ready expertise with a portfolio competitive for senior LLM/foundation-model roles at frontier labs.
Why This Curriculum Exists
The hiring bar at frontier labs (Anthropic, OpenAI, DeepMind, Meta AI, Mistral, xAI, Cohere) is not "have you used ChatGPT" — it is "can you implement attention from scratch, debug a 64-GPU training run, profile a CUDA kernel, design a 100k-QPS inference gateway, and explain why DPO converges differently than PPO".
This curriculum is built backward from real job postings (referenced below) and is structured so that every lab maps to a real interview question or production system you would build on the job.
Reference Job Targets
- Anthropic — Research Engineer, Pretraining (JD) → Phases 4, 5, 10, Capstone 1
- Anthropic — Research Engineer, Production Model Post-Training → Phases 6, 8, Capstone 4
- OpenAI — Research Engineer, Applied AI (JD) → Phases 7, 9, Capstones 2 & 3
- Google DeepMind — Research Engineer, Gemini Latent Thinking → Phases 4, 5, 6, 8
- Meta AI — Research / Production roles (Careers) → Phases 5, 9, 10
What You Will Build
By the end of this curriculum you will have shipped:
- A working BPE tokenizer that matches GPT-2 output byte-for-byte
- Word2Vec, attention, and a transformer block — all from scratch in NumPy and PyTorch
- A nanoGPT-style model trained on a custom corpus (TinyStories or your own)
- A LoRA / QLoRA fine-tuning pipeline on an open 7B model
- A DPO preference-optimization run with reward analysis
- A production-grade RAG system with hybrid retrieval, re-ranking, and an eval harness
- An inference gateway with continuous batching, KV-cache, streaming, quantization, observability
- A pretraining data pipeline with deduplication (MinHash), quality filtering (FastText/heuristics), and tokenization at scale
- A multi-GPU training experiment using FSDP / DDP with mixed precision and gradient accumulation
- An evaluation harness comparing base, fine-tuned, and RAG-augmented models on MMLU/HellaSwag/HumanEval-style tasks
- A complete portfolio of 10+ GitHub repos with READMEs, benchmarks, diagrams, and ablations
Folder Structure
llm-inference-engineer/
├── README.md ← You are here (master roadmap)
├── phase-01-foundations-text/ ← Tokenization, BoW, TF-IDF, similarity, PyTorch
├── phase-02-classical-nlp-embeddings/ ← Word2Vec, GloVe, FastText, embedding eval
├── phase-03-rnns-language-modeling/ ← RNN/LSTM/GRU, char-LM, seq2seq, Bahdanau attention
├── phase-04-attention-transformers/ ← Self-attention, MHA, positional encodings, full transformer
├── phase-05-training-small-llms/ ← Mini-GPT, BPE, training loop, mixed precision, sampling
├── phase-06-finetuning-instruction/ ← SFT, LoRA/QLoRA, instruction data, RLHF/DPO
├── phase-07-rag-retrieval/ ← Vector DBs, hybrid search, re-ranking, agents/tool use
├── phase-08-evaluation-safety/ ← Eval harness, LLM-as-judge, red-teaming, benchmarks
├── phase-09-inference-optimization/ ← KV-cache, quantization, batching, vLLM/TGI, spec decoding
├── phase-10-distributed-production/ ← DDP/FSDP, pretraining data pipeline, observability
├── phase-11-capstone/ ← 4 portfolio-grade end-to-end systems
├── system-design/ ← LLM-specific system design walkthroughs
└── interview-prep/ ← Concepts, coding, ML systems, behavioral
24-Week Schedule
| Week | Phase | Focus |
|---|---|---|
| 1 | 1 | Python/PyTorch refresh, tokenization (regex → BPE intuition) |
| 2 | 1 | BoW, TF-IDF from scratch, cosine-similarity search |
| 3 | 2 | Word2Vec skip-gram from scratch (NumPy + PyTorch) |
| 4 | 2 | GloVe, FastText, embedding evaluation (analogies, WordSim) |
| 5 | 3 | RNN forward/backward by hand, char-level language model |
| 6 | 3 | LSTM/GRU, gradient flow, seq2seq with Bahdanau attention |
| 7 | 4 | Scaled dot-product attention from scratch + masking |
| 8 | 4 | Multi-head attention, positional encodings (sinusoidal, RoPE, ALiBi) |
| 9 | 4 | Full transformer block, encoder/decoder/decoder-only variants |
| 10 | 5 | BPE tokenizer matching GPT-2; nanoGPT architecture |
| 11 | 5 | Training loop, mixed precision, grad accumulation, checkpointing |
| 12 | 5 | Sampling: greedy, top-k, top-p, temperature, beam, contrastive |
| 13 | 6 | Supervised fine-tuning (SFT) on instruction data |
| 14 | 6 | LoRA + QLoRA on a 7B open model |
| 15 | 6 | Reward modeling, DPO/IPO/KTO preference optimization |
| 16 | 7 | Embedding pipelines, vector DBs (FAISS, pgvector, Qdrant) |
| 17 | 7 | Hybrid retrieval (BM25 + dense), re-ranking, RAG eval |
| 18 | 7 | Agents, tool use, structured outputs, function calling |
| 19 | 8 | Eval harness (lm-eval-harness style), MMLU/HellaSwag scoring |
| 20 | 8 | LLM-as-judge, RAGAS, red-teaming, safety filters |
| 21 | 9 | KV-cache deep dive, paged attention, continuous batching |
| 22 | 9 | Quantization (INT8, INT4, AWQ, GPTQ), speculative decoding |
| 23 | 10 | DDP/FSDP, ZeRO, pretraining data pipeline (dedup, filter, tokenize) |
| 24 | 11 | Capstone integration + interview prep review |
Each Lab Structure
Every lab folder contains:
| File | Purpose |
|---|---|
README.md | Theory, math derivations, design rationale, interview Q&A, talking points |
lab.py | Guided exercise with # TODO markers — you fill in the blanks |
solution.py | Reference solution with inline commentary |
requirements.txt | Pinned pip dependencies |
DATASETS.md | Where applicable — download links and expected layout |
Project Specification Template
Every non-trivial project in this curriculum is described with the same template, so you can lift any lab into a portfolio-ready repo:
| Field | What it Captures |
|---|---|
| Project Title | Short, resume-friendly name |
| Goal | One sentence: what problem does this solve? |
| Concepts Learned | The 3–7 core ideas you internalize |
| Implementation Steps | Ordered checklist of what you build |
| Suggested Tech Stack | Libraries, frameworks, hardware tier |
| Dataset Suggestions | Specific datasets with sizes |
| Expected Output | Concrete artifact (model, plot, metric, server) |
| How to Test | Unit tests, sanity benchmarks, ablations |
| Interview Talking Points | Tradeoffs and design decisions to discuss |
| Resume Bullet Examples | Quantified achievement statements |
| Extensions | How to make the project portfolio-grade |
The phase READMEs (phase-XX/README.md) instantiate this template for every lab.
Prerequisites
- Python 3.10+
- Comfort with backend / distributed systems (you have this)
- Basic linear algebra (matrix multiply, eigenvectors) — Phase 1 has a refresher
- A Hugging Face account (free) for model + dataset access
- Optional: Weights & Biases / Comet ML account for experiment tracking
Hardware Recommendations
| Tier | Setup | Best For |
|---|---|---|
| Minimal | CPU laptop (16 GB RAM) | Phases 1–4, tiny models, NumPy from-scratch work |
| Mid | 1× consumer GPU (RTX 3090/4090, 24 GB) | Phases 5–9, fine-tuning ≤7B with QLoRA |
| Recommended | 1× A100 40 GB or 2× 4090 | Phase 5 nanoGPT training, full SFT on 7B |
| Cloud (cheap) | RunPod / Lambda / Vast.ai spot A100 — $1–2/hr | Phases 6, 9, 10 — pay only when training |
| Free tier | Google Colab T4, Kaggle P100 | Almost all labs in scaled-down form |
You do NOT need a GPU cluster. Every lab in this curriculum has a "small-model mode" that runs on Colab free tier. Capstones can be completed for under $50 of cloud GPU time.
System Design Philosophy
Every production-oriented lab (Phases 7, 9, 10) is evaluated on the same five axes that frontier-lab interviewers care about:
- Throughput — tokens/sec at the system level (not just the model)
- Latency — TTFT (time-to-first-token) and TPOT (time-per-output-token), P50/P99
- Memory efficiency — KV-cache size, activation memory, parameter offloading
- Cost — $/million-tokens served, $/training-run, GPU-hour utilization
- Observability — request tracing, token-level metrics, drift detection, eval-in-production
Each capstone explicitly reports numbers on these axes.
Phase-by-Phase Overview
Each phase has its own
README.mdwith full lab specs, concept list, deliverables, and interview questions. Below is the index — click into the phase folder for depth.
Phase 1 — Foundations: Text, Math, PyTorch
Concepts: Tokenization (whitespace → regex → byte-level), bag-of-words, TF-IDF, cosine similarity, PyTorch tensors/autograd, broadcasting, CPU/GPU dispatch. Difficulty: ⭐⭐☆☆☆ | Time: 1–2 weeks Deliverables: From-scratch TF-IDF search engine over a Wikipedia subset; PyTorch tensor playground notebook. Roles supported: All — this is non-negotiable foundation.
Phase 2 — Classical NLP & Static Embeddings
Concepts: Word2Vec (CBOW + skip-gram), negative sampling, GloVe, FastText subword, embedding evaluation (analogy, WordSim353), dimensionality reduction. Difficulty: ⭐⭐⭐☆☆ | Time: 1.5 weeks Deliverables: Skip-gram trained from scratch on text8; embedding visualization (t-SNE/UMAP); analogy benchmark report. Roles supported: Pretraining Data Engineer, Research Engineer.
Phase 3 — RNNs & Language Modeling
Concepts: Vanilla RNN forward/backward, vanishing gradients, LSTM gates, GRU, sequence-to-sequence, Bahdanau additive attention, teacher forcing, perplexity. Difficulty: ⭐⭐⭐☆☆ | Time: 1.5 weeks Deliverables: Char-RNN trained on Shakespeare; LSTM seq2seq translator (toy). Roles supported: Foundation Model Engineer (historical context); strong "explain attention" interview answer.
Phase 4 — Attention & Transformers (From Scratch)
Concepts: Scaled dot-product attention, masking (causal/padding), multi-head, sinusoidal/RoPE/ALiBi positional encodings, layer norm vs RMSNorm, residual streams, encoder/decoder/decoder-only. Difficulty: ⭐⭐⭐⭐☆ | Time: 2 weeks Deliverables: 200-line transformer that passes attention shape tests; visualized attention maps; ablation report (pre-norm vs post-norm). Roles supported: All research-engineer roles. The most-asked interview topic.
Phase 5 — Training Small LLMs
Concepts: BPE tokenization (matching GPT-2), nanoGPT architecture, AdamW, cosine LR schedule, mixed precision (BF16/FP16), gradient accumulation, gradient clipping, checkpointing, sampling (greedy/top-k/top-p/temperature/beam).
Difficulty: ⭐⭐⭐⭐☆ | Time: 2.5 weeks
Deliverables: BPE tokenizer matching tiktoken on test corpus; nanoGPT trained on TinyStories with W&B logs and loss curves.
Roles supported: Research Engineer Pretraining, Foundation Model Engineer.
Phase 6 — Fine-tuning, Instruction Tuning, Preference Optimization
Concepts: SFT, chat templates, LoRA / QLoRA (NF4), reward modeling, RLHF (PPO conceptual), DPO / IPO / KTO, RLAIF, constitutional AI. Difficulty: ⭐⭐⭐⭐☆ | Time: 2.5 weeks Deliverables: QLoRA fine-tune of Llama-3-8B or Qwen2-7B on a domain dataset; DPO run with preference dataset; before/after eval table. Roles supported: Post-training Engineer, Production Model Post-Training (Anthropic-style).
Phase 7 — RAG, Retrieval, Agents
Concepts: Embedding models (sentence-transformers, E5, BGE), FAISS vs HNSW vs IVF, hybrid retrieval (BM25 + dense), re-ranking (cross-encoder, ColBERT), chunking strategies, query rewriting, agent loops, tool use, structured output (JSON schema, constrained decoding). Difficulty: ⭐⭐⭐⭐☆ | Time: 2 weeks Deliverables: Production-style RAG over a real corpus with eval (RAGAS); agent that uses 3+ tools. Roles supported: Applied AI Engineer (OpenAI-style), LLM Inference Engineer.
Phase 8 — Evaluation & Safety
Concepts: Benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MT-Bench), perplexity vs downstream eval, LLM-as-judge bias, RAGAS, red-teaming, jailbreak taxonomy, safety classifiers.
Difficulty: ⭐⭐⭐⭐☆ | Time: 1.5 weeks
Deliverables: Forked lm-evaluation-harness task; LLM-as-judge harness with bias analysis; red-team report.
Roles supported: Model Evaluation Engineer, Safety roles.
Phase 9 — Inference Optimization & Serving
Concepts: KV-cache mechanics + memory math, paged attention (vLLM), continuous batching, INT8/INT4 quantization (GPTQ, AWQ, bitsandbytes), speculative decoding, prefix caching, FlashAttention-2/3, CUDA graphs, TensorRT-LLM, streaming via SSE. Difficulty: ⭐⭐⭐⭐⭐ | Time: 2.5 weeks Deliverables: Custom inference server with KV-cache + continuous batching + INT4 quantization; benchmark report (TTFT/TPOT/throughput). Roles supported: LLM Inference Engineer, ML Systems Engineer. Highest-leverage phase for infrastructure roles.
Phase 10 — Distributed Training & Pretraining Data
Concepts: DDP, FSDP, ZeRO-1/2/3, tensor/pipeline parallelism (conceptual), mixed precision strategies, NCCL, gradient checkpointing, activation recomputation, MinHash dedup, quality filtering (perplexity, FastText, heuristics), tokenization at scale, Common Crawl pipeline. Difficulty: ⭐⭐⭐⭐⭐ | Time: 2 weeks Deliverables: 2-GPU FSDP training run (rentable for ~$5); pretraining data pipeline processing 10 GB → deduped + tokenized shards. Roles supported: Pretraining Data Engineer, ML Infrastructure Engineer, Research Engineer Pretraining.
Phase 11 — Capstone Projects
Four portfolio-grade systems. Pick at least 2 to ship publicly.
- Mini-GPT pretrained on a custom corpus (your dataset, full pipeline, model card)
- Production RAG with eval (hybrid retrieval, RAGAS, A/B harness)
- LLM inference gateway (KV-cache, batching, quantization, streaming, observability)
- Domain-assistant fine-tune (SFT + DPO + eval comparison vs base)
The Top 10 Projects to Prioritize (Resume-Critical)
These are the projects that, when present on a portfolio, change interview outcomes:
| # | Project | Phase | Why It Matters |
|---|---|---|---|
| 1 | BPE tokenizer matching GPT-2 | 5 | Proves you understand pretraining stack from byte 0 |
| 2 | Attention from scratch + visualizations | 4 | The single most-asked LLM interview topic |
| 3 | nanoGPT trained on TinyStories | 5 | End-to-end training credibility |
| 4 | QLoRA fine-tune of a 7B model | 6 | Demonstrates GPU-efficient post-training |
| 5 | DPO run with reward analysis | 6 | Modern preference-optimization fluency |
| 6 | Production RAG with RAGAS eval | 7 | The most common "applied AI" interview project |
| 7 | Inference gateway (KV-cache + batching + INT4) | 9 | Direct fit for LLM Inference Engineer roles |
| 8 | Eval harness (base vs fine-tune vs RAG) | 8 | Shows scientific rigor |
| 9 | Pretraining data pipeline (dedup + filter + tokenize) | 10 | Direct fit for Pretraining Data Engineer roles |
| 10 | FSDP training run with profiling | 10 | Distributed-training credibility |
Top 20 Interview Questions (Curated)
Full answers in
interview-prep/01-concepts-cheatsheet.md.
- Derive scaled dot-product attention. Why divide by √d_k?
- Explain causal masking. Implement it in 5 lines.
- Compare sinusoidal, learned, RoPE, and ALiBi positional encodings.
- Why does pre-norm train more stably than post-norm?
- Walk through one forward + backward pass of a transformer block.
- Explain KV-cache. What is its memory footprint? When does it become the bottleneck?
- Compare LoRA, QLoRA, full fine-tuning. When would you use each?
- Explain DPO derivation. Why does it not need a separate reward model?
- Compare PPO and DPO. Pros and cons.
- Explain ZeRO-1/2/3 and FSDP. What does each shard?
- What is continuous batching? How does paged attention enable it?
- Compare INT8 and INT4 quantization (GPTQ vs AWQ vs bitsandbytes NF4).
- Speculative decoding — explain the algorithm and the speedup math.
- Compare BM25, dense retrieval, ColBERT, and a cross-encoder re-ranker.
- How would you build an LLM eval pipeline that catches regressions in prod?
- Design a RAG system for 100M documents at 1k QPS.
- Design an LLM inference gateway for 100k QPS with multi-model routing.
- Walk through a pretraining data pipeline: filtering, dedup, tokenization, sharding.
- Why is BPE the dominant tokenizer? What are its failure modes?
- Explain mixed precision (BF16 vs FP16) and loss scaling.
A Recommended Learning Order
1 → 2 → 3 → 4 (theory + scratch builds — sequential, no skipping)
↓
5 (training mechanics — sequential)
↓
├── 6 (fine-tuning) ──┐
├── 7 (RAG) ──┼──> 8 (evaluation ties everything together)
└── 9 (inference) ──┘
↓
10 (distributed) → 11 (capstones)
You can swap the order of 6 / 7 / 9 based on the role you're targeting.
Job Titles to Search For
Use these exact strings on LinkedIn / Greenhouse / Ashby / company career pages:
- "Research Engineer, Pretraining"
- "Research Engineer, Post-Training"
- "Research Engineer, Applied AI"
- "Foundation Model Engineer"
- "LLM Infrastructure Engineer"
- "ML Systems Engineer (LLM)"
- "LLM Inference Engineer"
- "ML Performance Engineer"
- "Machine Learning Engineer, Generative AI"
- "Model Evaluation Engineer"
- "AI Safety Engineer"
- "Pretraining Data Engineer"
- "Member of Technical Staff" (used by Anthropic, OpenAI, Mistral)
Skill Checklist — "Am I Ready to Apply?"
Apply when you can honestly check ✅ on at least 80% of these:
Theory
- Derive attention end-to-end on a whiteboard
- Implement multi-head attention from scratch in <50 lines
- Explain RoPE rotation math
- Compare LayerNorm vs RMSNorm and justify modern choice
- Explain KV-cache memory math
- Derive DPO loss from RLHF objective
- Explain LoRA's rank decomposition and why it works
- Compute the parameter count of a transformer given d_model, n_layers, n_heads, vocab_size
Engineering
- Train a transformer from scratch end-to-end
- Fine-tune a 7B+ model on a single 24GB GPU using QLoRA
- Run a multi-GPU FSDP training job
- Build a RAG system with hybrid retrieval and re-ranking
- Quantize a model to INT4 and measure quality regression
- Implement continuous batching for an inference server
- Build a pretraining data pipeline with MinHash dedup
Portfolio
- 8+ public GitHub repos with READMEs, benchmarks, diagrams
- At least 1 project with reproducible training run + W&B logs
- At least 1 project with profiling output (Nsight, PyTorch profiler)
- A blog post or technical writeup of one capstone
- A resume with quantified, LLM-specific bullets
6-Month Plan (Aggressive, ~15 hr/week)
| Month | Phases | Outcome |
|---|---|---|
| 1 | 1–3 | TF-IDF search, Word2Vec, char-RNN — all from scratch |
| 2 | 4–5 | Transformer + nanoGPT trained on TinyStories |
| 3 | 5–6 | Sampling strategies; QLoRA fine-tune of 7B |
| 4 | 6–7 | DPO + production RAG with eval |
| 5 | 8–9 | Eval harness; inference gateway with KV-cache + INT4 |
| 6 | 10–11 | FSDP run + pretraining data pipeline + 2 capstones |
12-Month Plan (Deeper, ~10 hr/week — recommended for career switchers)
Same as above but each month covers half the content; the extra months go to:
- Months 7–8: CUDA fundamentals + Triton kernels (write a fused softmax)
- Months 9–10: One frontier-paper reimplementation (FlashAttention, Mixture-of-Experts, Mamba)
- Months 11–12: Capstone polish, blog posts, open-source contributions to vLLM / TGI / Transformers / lm-eval-harness
GitHub Portfolio Structure (Recommended)
your-github/
├── llm-from-scratch/ ← Phases 1–4 in one repo (educational)
│ ├── 01-tokenization/
│ ├── 02-word2vec/
│ ├── 03-rnn-lstm/
│ └── 04-transformer/
├── nanogpt-tinystories/ ← Phase 5 capstone (single repo, polished)
├── qlora-domain-assistant/ ← Phase 6 capstone with eval
├── rag-production/ ← Phase 7 capstone, full README + diagrams
├── llm-inference-gateway/ ← Phase 9 capstone (the hire-magnet)
├── lm-eval-harness-extension/ ← Phase 8 — contribute to upstream
├── pretraining-data-pipeline/ ← Phase 10
└── blog/ ← MDX or plain markdown — link from each repo
Each Repo's README Should Have
- One-sentence pitch above the fold
- Architecture diagram (Excalidraw, Mermaid, or draw.io PNG)
- Benchmarks table (numbers > prose)
- Reproduction steps (
make train,make eval) - Tradeoffs section — why you chose X over Y
- Limitations — shows engineering maturity
- What I'd do next — shows extensibility thinking
Resume Bullet Patterns
Use the action → system → quantified outcome → technical depth pattern:
"Built an LLM inference gateway supporting continuous batching, paged KV-cache, and INT4 GPTQ quantization, achieving 3.2× throughput improvement (412 → 1,317 tok/s) and 41% lower P99 TTFT on Llama-3-8B at 32 concurrent requests."
"Implemented a MinHash-LSH deduplication and FastText quality-filtering pipeline processing 180 GB of CommonCrawl WET shards into 41 GB of training-ready tokens, with reproducible Snakemake DAG and per-shard quality histograms."
"Pre-trained a 42M-parameter decoder-only transformer from scratch on TinyStories using a custom BPE tokenizer matching GPT-2, mixed precision, gradient accumulation, and cosine LR schedule on a single A100; achieved train loss 1.42 / val 1.51 in 4.2 GPU-hours."
Tools & Technologies Covered
Languages: Python 3.11+, shell, basic CUDA/Triton overview
Core ML: PyTorch 2.x, NumPy
Models / Libs: Hugging Face transformers, datasets, accelerate, peft, trl
Tokenizers: tiktoken, sentencepiece, hf-tokenizers
Training: Lightning / pure PyTorch, FSDP, DeepSpeed (overview), bitsandbytes
Fine-tuning: LoRA, QLoRA, DPO/IPO/KTO via trl
Retrieval: FAISS, Qdrant, pgvector, sentence-transformers, BM25 (rank_bm25)
Eval: lm-evaluation-harness, RAGAS, MT-Bench, HELM concepts
Inference: vLLM, TGI, llama.cpp, TensorRT-LLM (overview), ONNX Runtime
Serving: FastAPI, Uvicorn, Triton Inference Server (overview)
Observability: OpenTelemetry, Prometheus, Grafana, Langfuse, W&B
Data: pyspark / dask / polars, datasketch (MinHash), fasttext
Hardware: CUDA, NCCL, BF16/FP16, A100/H100/L4/T4, AWQ/GPTQ
Quick Start
# 1. Navigate to the curriculum root
cd /path/to/llm-inference-engineer
# 2. Create a virtual environment
python -m venv .venv && source .venv/bin/activate
# 3. Install Phase 1 deps and start
pip install -r phase-01-foundations-text/lab-01-tokenization-from-scratch/requirements.txt
code phase-01-foundations-text/README.md
Mindset: You are not learning LLMs as an end. You are learning them well enough to build, debug, and ship the systems that frontier labs hire for. Every lab in this curriculum was designed by working backward from a real interview loop or a real production system. Do the work, ship the repos, and apply.