Phase 7 — RAG, Retrieval, Agents

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 2 weeks Roles supported: Applied AI Engineer (OpenAI-style), LLM Inference Engineer, ML Systems Engineer.


Why This Phase Exists

RAG is the most-deployed LLM pattern in industry. The OpenAI Applied AI Engineering JD is essentially "build production RAG and agentic systems". The interview bar is no longer "did you call a vector DB" — it is "did you compare BM25 vs dense vs hybrid vs ColBERT, did you re-rank, did you measure with RAGAS, did you handle long-context tradeoffs, did you build observability".


Concepts

  • Embedding models for retrieval: sentence-transformers, E5, BGE, Cohere embed, OpenAI text-embedding-3
  • Vector index types: flat, IVF, HNSW, PQ, IVF-PQ tradeoffs
  • Vector DBs: FAISS (library), Qdrant, Weaviate, pgvector, Milvus
  • Sparse retrieval: BM25, TF-IDF
  • Hybrid retrieval: RRF (reciprocal rank fusion), weighted sum
  • Re-ranking: cross-encoders (BGE-reranker), ColBERT (late interaction)
  • Chunking: fixed-size, sentence, recursive, semantic, late-chunking
  • Query rewriting / HyDE / multi-query
  • RAG evaluation: RAGAS (faithfulness, answer relevance, context precision/recall)
  • Agents: ReAct loop, tool use, function calling
  • Structured outputs: JSON schema, constrained decoding (Outlines, lm-format-enforcer, OpenAI structured outputs)
  • Long-context vs RAG tradeoff

Labs

Lab 01 — Embeddings & Vector Search Fundamentals

FieldValue
GoalBuild a FAISS-backed semantic search pipeline; compare 3 embedding models.
ConceptsEmbedding choice tradeoffs (dim, latency, quality), FAISS index types, normalization.
Steps1) Embed a 50k-document corpus with bge-small, bge-large, text-embedding-3-small. 2) Build flat + HNSW indices in FAISS. 3) Run query benchmarks — recall vs latency. 4) Plot tradeoffs.
StackFAISS, sentence-transformers, OpenAI API (optional), datasets
DatasetsBeIR/scifact (5k docs) or ms_marco (100k passages slice)
OutputRecall@10 vs query-latency curves for 3 models × 2 index types.
How to TestUse BeIR's labeled qrels; compute NDCG@10.
Talking PointsWhy HNSW dominates production. PQ for memory-bound deployments. The dim-vs-quality curve.
Resume Bullet"Benchmarked 3 embedding models × 2 FAISS index types on BeIR/SciFact (NDCG@10), producing reproducible recall-vs-latency tradeoff curves."
ExtensionsAdd Qdrant (production-style); add Matryoshka embeddings.

Lab 02 — Production RAG Pipeline (End-to-End)

FieldValue
GoalBuild a RAG system over a real corpus with proper chunking, retrieval, prompting, and citations.
ConceptsChunking strategy, prompt engineering for grounded answers, citation extraction, hallucination mitigation.
Steps1) Pick a corpus (your company docs, PubMed abstracts, EU AI Act). 2) Recursive chunking with overlap. 3) Embed + index (Qdrant). 4) Retrieval → context formatting → answer generation with citations. 5) Streaming response via SSE. 6) Wrap in FastAPI.
StackQdrant, sentence-transformers / OpenAI embeddings, FastAPI, SSE, Llama-3-8B (local) or hosted
DatasetsEU AI Act PDFs, PubMed open subset, your own
OutputA working /query endpoint that returns answers with chunk-level citations.
How to Test30 hand-crafted Q&A pairs; faithfulness evaluated manually + with RAGAS in Lab 4.
Talking PointsChunking-strategy tradeoffs. Why citations matter (auditability). Streaming vs full response.
Resume Bullet"Built a production RAG service over a 12k-document corpus with recursive chunking, Qdrant HNSW retrieval, streaming generation, and chunk-level citations exposed via FastAPI + SSE."
ExtensionsAdd per-user namespaces; add document-update reindexing.

Lab 03 — Hybrid Retrieval + Re-Ranking

FieldValue
GoalBeat dense-only retrieval by combining BM25 + dense + a cross-encoder re-ranker.
ConceptsRRF, weighted fusion, cross-encoder re-ranking math, latency budget.
Steps1) Add BM25 (rank_bm25 or Pyserini) to Lab 2's pipeline. 2) Implement RRF fusion. 3) Add BAAI/bge-reranker-base cross-encoder over top 100 → top 10. 4) Measure NDCG@10 across (dense / BM25 / hybrid / hybrid+rerank).
Stackrank_bm25, sentence-transformers (CrossEncoder), Qdrant
DatasetsSame as Lab 1/2
OutputA retrieval-quality table; updated production pipeline.
How to TestNDCG@10 hybrid+rerank > dense-only by ≥ 5 points.
Talking PointsWhy BM25 is still the best baseline (lexical match for proper nouns). Why re-rankers are slow (full cross-attention) — only over top-K. ColBERT as a middle ground.
Resume Bullet"Augmented dense retrieval with BM25 + RRF fusion + BGE cross-encoder re-ranking, lifting NDCG@10 from 0.41 to 0.58 on BeIR/SciFact at 38ms additional P99 latency."
ExtensionsImplement ColBERT late-interaction; add query expansion (HyDE).

Lab 04 — Agents, Tool Use, Structured Output

FieldValue
GoalBuild an agent that uses 3+ tools (RAG, calculator, web search) with reliable structured output.
ConceptsReAct loop, function calling, JSON-schema constrained decoding, tool registry, max-iterations safety.
Steps1) Define 3 tools: search_docs(query), calculator(expr), fetch_url(url). 2) Implement ReAct loop manually (no LangChain magic). 3) Use OpenAI function-calling format OR Outlines for constrained output. 4) Add iteration cap + tool-error handling. 5) Trace every tool call to a JSON log.
StackOpenAI / Anthropic / local model with function calling; Outlines or lm-format-enforcer
OutputA CLI agent that can answer "What's the GDP per capita of France divided by the population of Paris?" using tools.
How to Test10 multi-step tasks; success rate measured.
Talking PointsWhy constrained decoding > regex parsing JSON. Why agents fail (compounding errors, infinite loops). When NOT to use an agent.
Resume Bullet"Implemented a ReAct-style tool-using agent (RAG + calculator + web fetch) with JSON-schema constrained decoding, full per-call tracing, and bounded iteration; 8/10 success on multi-hop reasoning evals."
ExtensionsAdd memory (per-session conversation store); add planning step (decompose-then-execute).

Deliverables Checklist

  • FAISS embedding-model benchmark
  • Production RAG service with citations + streaming
  • Hybrid retrieval + re-ranking with quality lift report
  • Tool-using agent with constrained outputs

Interview Relevance

  • "Design a RAG system for 100M docs at 1k QPS" (system design — see system-design/)
  • "How do you evaluate RAG quality?"
  • "Compare BM25, dense, hybrid"
  • "How would you build an agent reliably?"