LLM / Foundation-Model Engineer — Complete Learning Curriculum

Target Roles:

  • Research Engineer, Pretraining (Anthropic, OpenAI, DeepMind, Meta, Mistral, xAI)
  • LLM Infrastructure Engineer / ML Systems Engineer
  • Foundation Model Engineer
  • Post-training / Fine-tuning Engineer (RLHF, DPO, SFT)
  • LLM Inference Engineer (vLLM/TGI/TensorRT-LLM class work)
  • Model Evaluation Engineer
  • Pretraining Data Engineer
  • Applied AI / Production AI Engineer

Duration: 24 weeks core (6 months) — extendable to 12 months for deep specialization Goal: Reach interview-ready expertise with a portfolio competitive for senior LLM/foundation-model roles at frontier labs.


Why This Curriculum Exists

The hiring bar at frontier labs (Anthropic, OpenAI, DeepMind, Meta AI, Mistral, xAI, Cohere) is not "have you used ChatGPT" — it is "can you implement attention from scratch, debug a 64-GPU training run, profile a CUDA kernel, design a 100k-QPS inference gateway, and explain why DPO converges differently than PPO".

This curriculum is built backward from real job postings (referenced below) and is structured so that every lab maps to a real interview question or production system you would build on the job.

Reference Job Targets

  • Anthropic — Research Engineer, Pretraining (JD) → Phases 4, 5, 10, Capstone 1
  • Anthropic — Research Engineer, Production Model Post-Training → Phases 6, 8, Capstone 4
  • OpenAI — Research Engineer, Applied AI (JD) → Phases 7, 9, Capstones 2 & 3
  • Google DeepMind — Research Engineer, Gemini Latent Thinking → Phases 4, 5, 6, 8
  • Meta AI — Research / Production roles (Careers) → Phases 5, 9, 10

What You Will Build

By the end of this curriculum you will have shipped:

  • A working BPE tokenizer that matches GPT-2 output byte-for-byte
  • Word2Vec, attention, and a transformer block — all from scratch in NumPy and PyTorch
  • A nanoGPT-style model trained on a custom corpus (TinyStories or your own)
  • A LoRA / QLoRA fine-tuning pipeline on an open 7B model
  • A DPO preference-optimization run with reward analysis
  • A production-grade RAG system with hybrid retrieval, re-ranking, and an eval harness
  • An inference gateway with continuous batching, KV-cache, streaming, quantization, observability
  • A pretraining data pipeline with deduplication (MinHash), quality filtering (FastText/heuristics), and tokenization at scale
  • A multi-GPU training experiment using FSDP / DDP with mixed precision and gradient accumulation
  • An evaluation harness comparing base, fine-tuned, and RAG-augmented models on MMLU/HellaSwag/HumanEval-style tasks
  • A complete portfolio of 10+ GitHub repos with READMEs, benchmarks, diagrams, and ablations

Folder Structure

llm-inference-engineer/
├── README.md                              ← You are here (master roadmap)
├── phase-01-foundations-text/             ← Tokenization, BoW, TF-IDF, similarity, PyTorch
├── phase-02-classical-nlp-embeddings/     ← Word2Vec, GloVe, FastText, embedding eval
├── phase-03-rnns-language-modeling/       ← RNN/LSTM/GRU, char-LM, seq2seq, Bahdanau attention
├── phase-04-attention-transformers/       ← Self-attention, MHA, positional encodings, full transformer
├── phase-05-training-small-llms/          ← Mini-GPT, BPE, training loop, mixed precision, sampling
├── phase-06-finetuning-instruction/       ← SFT, LoRA/QLoRA, instruction data, RLHF/DPO
├── phase-07-rag-retrieval/                ← Vector DBs, hybrid search, re-ranking, agents/tool use
├── phase-08-evaluation-safety/            ← Eval harness, LLM-as-judge, red-teaming, benchmarks
├── phase-09-inference-optimization/       ← KV-cache, quantization, batching, vLLM/TGI, spec decoding
├── phase-10-distributed-production/       ← DDP/FSDP, pretraining data pipeline, observability
├── phase-11-capstone/                     ← 4 portfolio-grade end-to-end systems
├── system-design/                         ← LLM-specific system design walkthroughs
└── interview-prep/                        ← Concepts, coding, ML systems, behavioral

24-Week Schedule

WeekPhaseFocus
11Python/PyTorch refresh, tokenization (regex → BPE intuition)
21BoW, TF-IDF from scratch, cosine-similarity search
32Word2Vec skip-gram from scratch (NumPy + PyTorch)
42GloVe, FastText, embedding evaluation (analogies, WordSim)
53RNN forward/backward by hand, char-level language model
63LSTM/GRU, gradient flow, seq2seq with Bahdanau attention
74Scaled dot-product attention from scratch + masking
84Multi-head attention, positional encodings (sinusoidal, RoPE, ALiBi)
94Full transformer block, encoder/decoder/decoder-only variants
105BPE tokenizer matching GPT-2; nanoGPT architecture
115Training loop, mixed precision, grad accumulation, checkpointing
125Sampling: greedy, top-k, top-p, temperature, beam, contrastive
136Supervised fine-tuning (SFT) on instruction data
146LoRA + QLoRA on a 7B open model
156Reward modeling, DPO/IPO/KTO preference optimization
167Embedding pipelines, vector DBs (FAISS, pgvector, Qdrant)
177Hybrid retrieval (BM25 + dense), re-ranking, RAG eval
187Agents, tool use, structured outputs, function calling
198Eval harness (lm-eval-harness style), MMLU/HellaSwag scoring
208LLM-as-judge, RAGAS, red-teaming, safety filters
219KV-cache deep dive, paged attention, continuous batching
229Quantization (INT8, INT4, AWQ, GPTQ), speculative decoding
2310DDP/FSDP, ZeRO, pretraining data pipeline (dedup, filter, tokenize)
2411Capstone integration + interview prep review

Each Lab Structure

Every lab folder contains:

FilePurpose
README.mdTheory, math derivations, design rationale, interview Q&A, talking points
lab.pyGuided exercise with # TODO markers — you fill in the blanks
solution.pyReference solution with inline commentary
requirements.txtPinned pip dependencies
DATASETS.mdWhere applicable — download links and expected layout

Project Specification Template

Every non-trivial project in this curriculum is described with the same template, so you can lift any lab into a portfolio-ready repo:

FieldWhat it Captures
Project TitleShort, resume-friendly name
GoalOne sentence: what problem does this solve?
Concepts LearnedThe 3–7 core ideas you internalize
Implementation StepsOrdered checklist of what you build
Suggested Tech StackLibraries, frameworks, hardware tier
Dataset SuggestionsSpecific datasets with sizes
Expected OutputConcrete artifact (model, plot, metric, server)
How to TestUnit tests, sanity benchmarks, ablations
Interview Talking PointsTradeoffs and design decisions to discuss
Resume Bullet ExamplesQuantified achievement statements
ExtensionsHow to make the project portfolio-grade

The phase READMEs (phase-XX/README.md) instantiate this template for every lab.


Prerequisites

  • Python 3.10+
  • Comfort with backend / distributed systems (you have this)
  • Basic linear algebra (matrix multiply, eigenvectors) — Phase 1 has a refresher
  • A Hugging Face account (free) for model + dataset access
  • Optional: Weights & Biases / Comet ML account for experiment tracking

Hardware Recommendations

TierSetupBest For
MinimalCPU laptop (16 GB RAM)Phases 1–4, tiny models, NumPy from-scratch work
Mid1× consumer GPU (RTX 3090/4090, 24 GB)Phases 5–9, fine-tuning ≤7B with QLoRA
Recommended1× A100 40 GB or 2× 4090Phase 5 nanoGPT training, full SFT on 7B
Cloud (cheap)RunPod / Lambda / Vast.ai spot A100 — $1–2/hrPhases 6, 9, 10 — pay only when training
Free tierGoogle Colab T4, Kaggle P100Almost all labs in scaled-down form

You do NOT need a GPU cluster. Every lab in this curriculum has a "small-model mode" that runs on Colab free tier. Capstones can be completed for under $50 of cloud GPU time.


System Design Philosophy

Every production-oriented lab (Phases 7, 9, 10) is evaluated on the same five axes that frontier-lab interviewers care about:

  1. Throughput — tokens/sec at the system level (not just the model)
  2. Latency — TTFT (time-to-first-token) and TPOT (time-per-output-token), P50/P99
  3. Memory efficiency — KV-cache size, activation memory, parameter offloading
  4. Cost — $/million-tokens served, $/training-run, GPU-hour utilization
  5. Observability — request tracing, token-level metrics, drift detection, eval-in-production

Each capstone explicitly reports numbers on these axes.


Phase-by-Phase Overview

Each phase has its own README.md with full lab specs, concept list, deliverables, and interview questions. Below is the index — click into the phase folder for depth.

Phase 1 — Foundations: Text, Math, PyTorch

Concepts: Tokenization (whitespace → regex → byte-level), bag-of-words, TF-IDF, cosine similarity, PyTorch tensors/autograd, broadcasting, CPU/GPU dispatch. Difficulty: ⭐⭐☆☆☆ | Time: 1–2 weeks Deliverables: From-scratch TF-IDF search engine over a Wikipedia subset; PyTorch tensor playground notebook. Roles supported: All — this is non-negotiable foundation.

Phase 2 — Classical NLP & Static Embeddings

Concepts: Word2Vec (CBOW + skip-gram), negative sampling, GloVe, FastText subword, embedding evaluation (analogy, WordSim353), dimensionality reduction. Difficulty: ⭐⭐⭐☆☆ | Time: 1.5 weeks Deliverables: Skip-gram trained from scratch on text8; embedding visualization (t-SNE/UMAP); analogy benchmark report. Roles supported: Pretraining Data Engineer, Research Engineer.

Phase 3 — RNNs & Language Modeling

Concepts: Vanilla RNN forward/backward, vanishing gradients, LSTM gates, GRU, sequence-to-sequence, Bahdanau additive attention, teacher forcing, perplexity. Difficulty: ⭐⭐⭐☆☆ | Time: 1.5 weeks Deliverables: Char-RNN trained on Shakespeare; LSTM seq2seq translator (toy). Roles supported: Foundation Model Engineer (historical context); strong "explain attention" interview answer.

Phase 4 — Attention & Transformers (From Scratch)

Concepts: Scaled dot-product attention, masking (causal/padding), multi-head, sinusoidal/RoPE/ALiBi positional encodings, layer norm vs RMSNorm, residual streams, encoder/decoder/decoder-only. Difficulty: ⭐⭐⭐⭐☆ | Time: 2 weeks Deliverables: 200-line transformer that passes attention shape tests; visualized attention maps; ablation report (pre-norm vs post-norm). Roles supported: All research-engineer roles. The most-asked interview topic.

Phase 5 — Training Small LLMs

Concepts: BPE tokenization (matching GPT-2), nanoGPT architecture, AdamW, cosine LR schedule, mixed precision (BF16/FP16), gradient accumulation, gradient clipping, checkpointing, sampling (greedy/top-k/top-p/temperature/beam). Difficulty: ⭐⭐⭐⭐☆ | Time: 2.5 weeks Deliverables: BPE tokenizer matching tiktoken on test corpus; nanoGPT trained on TinyStories with W&B logs and loss curves. Roles supported: Research Engineer Pretraining, Foundation Model Engineer.

Phase 6 — Fine-tuning, Instruction Tuning, Preference Optimization

Concepts: SFT, chat templates, LoRA / QLoRA (NF4), reward modeling, RLHF (PPO conceptual), DPO / IPO / KTO, RLAIF, constitutional AI. Difficulty: ⭐⭐⭐⭐☆ | Time: 2.5 weeks Deliverables: QLoRA fine-tune of Llama-3-8B or Qwen2-7B on a domain dataset; DPO run with preference dataset; before/after eval table. Roles supported: Post-training Engineer, Production Model Post-Training (Anthropic-style).

Phase 7 — RAG, Retrieval, Agents

Concepts: Embedding models (sentence-transformers, E5, BGE), FAISS vs HNSW vs IVF, hybrid retrieval (BM25 + dense), re-ranking (cross-encoder, ColBERT), chunking strategies, query rewriting, agent loops, tool use, structured output (JSON schema, constrained decoding). Difficulty: ⭐⭐⭐⭐☆ | Time: 2 weeks Deliverables: Production-style RAG over a real corpus with eval (RAGAS); agent that uses 3+ tools. Roles supported: Applied AI Engineer (OpenAI-style), LLM Inference Engineer.

Phase 8 — Evaluation & Safety

Concepts: Benchmarks (MMLU, HellaSwag, GSM8K, HumanEval, IFEval, MT-Bench), perplexity vs downstream eval, LLM-as-judge bias, RAGAS, red-teaming, jailbreak taxonomy, safety classifiers. Difficulty: ⭐⭐⭐⭐☆ | Time: 1.5 weeks Deliverables: Forked lm-evaluation-harness task; LLM-as-judge harness with bias analysis; red-team report. Roles supported: Model Evaluation Engineer, Safety roles.

Phase 9 — Inference Optimization & Serving

Concepts: KV-cache mechanics + memory math, paged attention (vLLM), continuous batching, INT8/INT4 quantization (GPTQ, AWQ, bitsandbytes), speculative decoding, prefix caching, FlashAttention-2/3, CUDA graphs, TensorRT-LLM, streaming via SSE. Difficulty: ⭐⭐⭐⭐⭐ | Time: 2.5 weeks Deliverables: Custom inference server with KV-cache + continuous batching + INT4 quantization; benchmark report (TTFT/TPOT/throughput). Roles supported: LLM Inference Engineer, ML Systems Engineer. Highest-leverage phase for infrastructure roles.

Phase 10 — Distributed Training & Pretraining Data

Concepts: DDP, FSDP, ZeRO-1/2/3, tensor/pipeline parallelism (conceptual), mixed precision strategies, NCCL, gradient checkpointing, activation recomputation, MinHash dedup, quality filtering (perplexity, FastText, heuristics), tokenization at scale, Common Crawl pipeline. Difficulty: ⭐⭐⭐⭐⭐ | Time: 2 weeks Deliverables: 2-GPU FSDP training run (rentable for ~$5); pretraining data pipeline processing 10 GB → deduped + tokenized shards. Roles supported: Pretraining Data Engineer, ML Infrastructure Engineer, Research Engineer Pretraining.

Phase 11 — Capstone Projects

Four portfolio-grade systems. Pick at least 2 to ship publicly.

  1. Mini-GPT pretrained on a custom corpus (your dataset, full pipeline, model card)
  2. Production RAG with eval (hybrid retrieval, RAGAS, A/B harness)
  3. LLM inference gateway (KV-cache, batching, quantization, streaming, observability)
  4. Domain-assistant fine-tune (SFT + DPO + eval comparison vs base)

The Top 10 Projects to Prioritize (Resume-Critical)

These are the projects that, when present on a portfolio, change interview outcomes:

#ProjectPhaseWhy It Matters
1BPE tokenizer matching GPT-25Proves you understand pretraining stack from byte 0
2Attention from scratch + visualizations4The single most-asked LLM interview topic
3nanoGPT trained on TinyStories5End-to-end training credibility
4QLoRA fine-tune of a 7B model6Demonstrates GPU-efficient post-training
5DPO run with reward analysis6Modern preference-optimization fluency
6Production RAG with RAGAS eval7The most common "applied AI" interview project
7Inference gateway (KV-cache + batching + INT4)9Direct fit for LLM Inference Engineer roles
8Eval harness (base vs fine-tune vs RAG)8Shows scientific rigor
9Pretraining data pipeline (dedup + filter + tokenize)10Direct fit for Pretraining Data Engineer roles
10FSDP training run with profiling10Distributed-training credibility

Top 20 Interview Questions (Curated)

Full answers in interview-prep/01-concepts-cheatsheet.md.

  1. Derive scaled dot-product attention. Why divide by √d_k?
  2. Explain causal masking. Implement it in 5 lines.
  3. Compare sinusoidal, learned, RoPE, and ALiBi positional encodings.
  4. Why does pre-norm train more stably than post-norm?
  5. Walk through one forward + backward pass of a transformer block.
  6. Explain KV-cache. What is its memory footprint? When does it become the bottleneck?
  7. Compare LoRA, QLoRA, full fine-tuning. When would you use each?
  8. Explain DPO derivation. Why does it not need a separate reward model?
  9. Compare PPO and DPO. Pros and cons.
  10. Explain ZeRO-1/2/3 and FSDP. What does each shard?
  11. What is continuous batching? How does paged attention enable it?
  12. Compare INT8 and INT4 quantization (GPTQ vs AWQ vs bitsandbytes NF4).
  13. Speculative decoding — explain the algorithm and the speedup math.
  14. Compare BM25, dense retrieval, ColBERT, and a cross-encoder re-ranker.
  15. How would you build an LLM eval pipeline that catches regressions in prod?
  16. Design a RAG system for 100M documents at 1k QPS.
  17. Design an LLM inference gateway for 100k QPS with multi-model routing.
  18. Walk through a pretraining data pipeline: filtering, dedup, tokenization, sharding.
  19. Why is BPE the dominant tokenizer? What are its failure modes?
  20. Explain mixed precision (BF16 vs FP16) and loss scaling.

1 → 2 → 3 → 4  (theory + scratch builds — sequential, no skipping)
        ↓
        5  (training mechanics — sequential)
        ↓
        ├── 6  (fine-tuning) ──┐
        ├── 7  (RAG)        ──┼──> 8 (evaluation ties everything together)
        └── 9  (inference)  ──┘
                ↓
                10 (distributed) → 11 (capstones)

You can swap the order of 6 / 7 / 9 based on the role you're targeting.


Job Titles to Search For

Use these exact strings on LinkedIn / Greenhouse / Ashby / company career pages:

  • "Research Engineer, Pretraining"
  • "Research Engineer, Post-Training"
  • "Research Engineer, Applied AI"
  • "Foundation Model Engineer"
  • "LLM Infrastructure Engineer"
  • "ML Systems Engineer (LLM)"
  • "LLM Inference Engineer"
  • "ML Performance Engineer"
  • "Machine Learning Engineer, Generative AI"
  • "Model Evaluation Engineer"
  • "AI Safety Engineer"
  • "Pretraining Data Engineer"
  • "Member of Technical Staff" (used by Anthropic, OpenAI, Mistral)

Skill Checklist — "Am I Ready to Apply?"

Apply when you can honestly check ✅ on at least 80% of these:

Theory

  • Derive attention end-to-end on a whiteboard
  • Implement multi-head attention from scratch in <50 lines
  • Explain RoPE rotation math
  • Compare LayerNorm vs RMSNorm and justify modern choice
  • Explain KV-cache memory math
  • Derive DPO loss from RLHF objective
  • Explain LoRA's rank decomposition and why it works
  • Compute the parameter count of a transformer given d_model, n_layers, n_heads, vocab_size

Engineering

  • Train a transformer from scratch end-to-end
  • Fine-tune a 7B+ model on a single 24GB GPU using QLoRA
  • Run a multi-GPU FSDP training job
  • Build a RAG system with hybrid retrieval and re-ranking
  • Quantize a model to INT4 and measure quality regression
  • Implement continuous batching for an inference server
  • Build a pretraining data pipeline with MinHash dedup

Portfolio

  • 8+ public GitHub repos with READMEs, benchmarks, diagrams
  • At least 1 project with reproducible training run + W&B logs
  • At least 1 project with profiling output (Nsight, PyTorch profiler)
  • A blog post or technical writeup of one capstone
  • A resume with quantified, LLM-specific bullets

6-Month Plan (Aggressive, ~15 hr/week)

MonthPhasesOutcome
11–3TF-IDF search, Word2Vec, char-RNN — all from scratch
24–5Transformer + nanoGPT trained on TinyStories
35–6Sampling strategies; QLoRA fine-tune of 7B
46–7DPO + production RAG with eval
58–9Eval harness; inference gateway with KV-cache + INT4
610–11FSDP run + pretraining data pipeline + 2 capstones

Same as above but each month covers half the content; the extra months go to:

  • Months 7–8: CUDA fundamentals + Triton kernels (write a fused softmax)
  • Months 9–10: One frontier-paper reimplementation (FlashAttention, Mixture-of-Experts, Mamba)
  • Months 11–12: Capstone polish, blog posts, open-source contributions to vLLM / TGI / Transformers / lm-eval-harness

your-github/
├── llm-from-scratch/                   ← Phases 1–4 in one repo (educational)
│   ├── 01-tokenization/
│   ├── 02-word2vec/
│   ├── 03-rnn-lstm/
│   └── 04-transformer/
├── nanogpt-tinystories/                ← Phase 5 capstone (single repo, polished)
├── qlora-domain-assistant/             ← Phase 6 capstone with eval
├── rag-production/                     ← Phase 7 capstone, full README + diagrams
├── llm-inference-gateway/              ← Phase 9 capstone (the hire-magnet)
├── lm-eval-harness-extension/          ← Phase 8 — contribute to upstream
├── pretraining-data-pipeline/          ← Phase 10
└── blog/                               ← MDX or plain markdown — link from each repo

Each Repo's README Should Have

  1. One-sentence pitch above the fold
  2. Architecture diagram (Excalidraw, Mermaid, or draw.io PNG)
  3. Benchmarks table (numbers > prose)
  4. Reproduction steps (make train, make eval)
  5. Tradeoffs section — why you chose X over Y
  6. Limitations — shows engineering maturity
  7. What I'd do next — shows extensibility thinking

Resume Bullet Patterns

Use the action → system → quantified outcome → technical depth pattern:

"Built an LLM inference gateway supporting continuous batching, paged KV-cache, and INT4 GPTQ quantization, achieving 3.2× throughput improvement (412 → 1,317 tok/s) and 41% lower P99 TTFT on Llama-3-8B at 32 concurrent requests."

"Implemented a MinHash-LSH deduplication and FastText quality-filtering pipeline processing 180 GB of CommonCrawl WET shards into 41 GB of training-ready tokens, with reproducible Snakemake DAG and per-shard quality histograms."

"Pre-trained a 42M-parameter decoder-only transformer from scratch on TinyStories using a custom BPE tokenizer matching GPT-2, mixed precision, gradient accumulation, and cosine LR schedule on a single A100; achieved train loss 1.42 / val 1.51 in 4.2 GPU-hours."


Tools & Technologies Covered

Languages:        Python 3.11+, shell, basic CUDA/Triton overview
Core ML:          PyTorch 2.x, NumPy
Models / Libs:    Hugging Face transformers, datasets, accelerate, peft, trl
Tokenizers:       tiktoken, sentencepiece, hf-tokenizers
Training:         Lightning / pure PyTorch, FSDP, DeepSpeed (overview), bitsandbytes
Fine-tuning:      LoRA, QLoRA, DPO/IPO/KTO via trl
Retrieval:        FAISS, Qdrant, pgvector, sentence-transformers, BM25 (rank_bm25)
Eval:             lm-evaluation-harness, RAGAS, MT-Bench, HELM concepts
Inference:        vLLM, TGI, llama.cpp, TensorRT-LLM (overview), ONNX Runtime
Serving:          FastAPI, Uvicorn, Triton Inference Server (overview)
Observability:    OpenTelemetry, Prometheus, Grafana, Langfuse, W&B
Data:             pyspark / dask / polars, datasketch (MinHash), fasttext
Hardware:         CUDA, NCCL, BF16/FP16, A100/H100/L4/T4, AWQ/GPTQ

Quick Start

# 1. Navigate to the curriculum root
cd /path/to/llm-inference-engineer

# 2. Create a virtual environment
python -m venv .venv && source .venv/bin/activate

# 3. Install Phase 1 deps and start
pip install -r phase-01-foundations-text/lab-01-tokenization-from-scratch/requirements.txt
code phase-01-foundations-text/README.md

Mindset: You are not learning LLMs as an end. You are learning them well enough to build, debug, and ship the systems that frontier labs hire for. Every lab in this curriculum was designed by working backward from a real interview loop or a real production system. Do the work, ship the repos, and apply.