Phase 7 — RAG, Retrieval, Agents

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2 weeks Roles supported: Applied AI Engineer (OpenAI-style), LLM Inference Engineer, ML Systems Engineer.

Why This Phase Exists

RAG is the most-deployed LLM pattern in industry. The OpenAI Applied AI Engineering JD is essentially "build production RAG and agentic systems". The interview bar is no longer "did you call a vector DB" — it is "did you compare BM25 vs dense vs hybrid vs ColBERT, did you re-rank, did you measure with RAGAS, did you handle long-context tradeoffs, did you build observability".

Concepts

Embedding models for retrieval: sentence-transformers, E5, BGE, Cohere embed, OpenAI text-embedding-3
Vector index types: flat, IVF, HNSW, PQ, IVF-PQ tradeoffs
Vector DBs: FAISS (library), Qdrant, Weaviate, pgvector, Milvus
Sparse retrieval: BM25, TF-IDF
Hybrid retrieval: RRF (reciprocal rank fusion), weighted sum
Re-ranking: cross-encoders (BGE-reranker), ColBERT (late interaction)
Chunking: fixed-size, sentence, recursive, semantic, late-chunking
Query rewriting / HyDE / multi-query
RAG evaluation: RAGAS (faithfulness, answer relevance, context precision/recall)
Agents: ReAct loop, tool use, function calling
Structured outputs: JSON schema, constrained decoding (Outlines, lm-format-enforcer, OpenAI structured outputs)
Long-context vs RAG tradeoff

Labs

Lab 01 — Embeddings & Vector Search Fundamentals

Field	Value
Goal	Build a FAISS-backed semantic search pipeline; compare 3 embedding models.
Concepts	Embedding choice tradeoffs (dim, latency, quality), FAISS index types, normalization.
Steps	1) Embed a 50k-document corpus with `bge-small`, `bge-large`, `text-embedding-3-small`. 2) Build flat + HNSW indices in FAISS. 3) Run query benchmarks — recall vs latency. 4) Plot tradeoffs.
Stack	FAISS, sentence-transformers, OpenAI API (optional), `datasets`
Datasets	`BeIR/scifact` (5k docs) or `ms_marco` (100k passages slice)
Output	Recall@10 vs query-latency curves for 3 models × 2 index types.
How to Test	Use BeIR's labeled qrels; compute NDCG@10.
Talking Points	Why HNSW dominates production. PQ for memory-bound deployments. The dim-vs-quality curve.
Resume Bullet	"Benchmarked 3 embedding models × 2 FAISS index types on BeIR/SciFact (NDCG@10), producing reproducible recall-vs-latency tradeoff curves."
Extensions	Add Qdrant (production-style); add Matryoshka embeddings.

Lab 02 — Production RAG Pipeline (End-to-End)

Field	Value
Goal	Build a RAG system over a real corpus with proper chunking, retrieval, prompting, and citations.
Concepts	Chunking strategy, prompt engineering for grounded answers, citation extraction, hallucination mitigation.
Steps	1) Pick a corpus (your company docs, PubMed abstracts, EU AI Act). 2) Recursive chunking with overlap. 3) Embed + index (Qdrant). 4) Retrieval → context formatting → answer generation with citations. 5) Streaming response via SSE. 6) Wrap in FastAPI.
Stack	Qdrant, sentence-transformers / OpenAI embeddings, FastAPI, SSE, Llama-3-8B (local) or hosted
Datasets	EU AI Act PDFs, PubMed open subset, your own
Output	A working `/query` endpoint that returns answers with chunk-level citations.
How to Test	30 hand-crafted Q&A pairs; faithfulness evaluated manually + with RAGAS in Lab 4.
Talking Points	Chunking-strategy tradeoffs. Why citations matter (auditability). Streaming vs full response.
Resume Bullet	"Built a production RAG service over a 12k-document corpus with recursive chunking, Qdrant HNSW retrieval, streaming generation, and chunk-level citations exposed via FastAPI + SSE."
Extensions	Add per-user namespaces; add document-update reindexing.

Lab 03 — Hybrid Retrieval + Re-Ranking

Field	Value
Goal	Beat dense-only retrieval by combining BM25 + dense + a cross-encoder re-ranker.
Concepts	RRF, weighted fusion, cross-encoder re-ranking math, latency budget.
Steps	1) Add BM25 (rank_bm25 or Pyserini) to Lab 2's pipeline. 2) Implement RRF fusion. 3) Add `BAAI/bge-reranker-base` cross-encoder over top 100 → top 10. 4) Measure NDCG@10 across (dense / BM25 / hybrid / hybrid+rerank).
Stack	rank_bm25, sentence-transformers (CrossEncoder), Qdrant
Datasets	Same as Lab 1/2
Output	A retrieval-quality table; updated production pipeline.
How to Test	NDCG@10 hybrid+rerank > dense-only by ≥ 5 points.
Talking Points	Why BM25 is still the best baseline (lexical match for proper nouns). Why re-rankers are slow (full cross-attention) — only over top-K. ColBERT as a middle ground.
Resume Bullet	"Augmented dense retrieval with BM25 + RRF fusion + BGE cross-encoder re-ranking, lifting NDCG@10 from 0.41 to 0.58 on BeIR/SciFact at 38ms additional P99 latency."
Extensions	Implement ColBERT late-interaction; add query expansion (HyDE).

Lab 04 — Agents, Tool Use, Structured Output

Field	Value
Goal	Build an agent that uses 3+ tools (RAG, calculator, web search) with reliable structured output.
Concepts	ReAct loop, function calling, JSON-schema constrained decoding, tool registry, max-iterations safety.
Steps	1) Define 3 tools: `search_docs(query)`, `calculator(expr)`, `fetch_url(url)`. 2) Implement ReAct loop manually (no LangChain magic). 3) Use OpenAI function-calling format OR Outlines for constrained output. 4) Add iteration cap + tool-error handling. 5) Trace every tool call to a JSON log.
Stack	OpenAI / Anthropic / local model with function calling; Outlines or lm-format-enforcer
Output	A CLI agent that can answer "What's the GDP per capita of France divided by the population of Paris?" using tools.
How to Test	10 multi-step tasks; success rate measured.
Talking Points	Why constrained decoding > regex parsing JSON. Why agents fail (compounding errors, infinite loops). When NOT to use an agent.
Resume Bullet	"Implemented a ReAct-style tool-using agent (RAG + calculator + web fetch) with JSON-schema constrained decoding, full per-call tracing, and bounded iteration; 8/10 success on multi-hop reasoning evals."
Extensions	Add memory (per-session conversation store); add planning step (decompose-then-execute).

Deliverables Checklist

FAISS embedding-model benchmark
Production RAG service with citations + streaming
Hybrid retrieval + re-ranking with quality lift report
Tool-using agent with constrained outputs

Interview Relevance

"Design a RAG system for 100M docs at 1k QPS" (system design — see system-design/)
"How do you evaluate RAG quality?"
"Compare BM25, dense, hybrid"
"How would you build an agent reliably?"