RAG is the most-deployed LLM pattern in industry. The OpenAI Applied AI Engineering JD is essentially "build production RAG and agentic systems". The interview bar is no longer "did you call a vector DB" — it is "did you compare BM25 vs dense vs hybrid vs ColBERT, did you re-rank, did you measure with RAGAS, did you handle long-context tradeoffs, did you build observability".
Build a FAISS-backed semantic search pipeline; compare 3 embedding models.
Concepts
Embedding choice tradeoffs (dim, latency, quality), FAISS index types, normalization.
Steps
1) Embed a 50k-document corpus with bge-small, bge-large, text-embedding-3-small. 2) Build flat + HNSW indices in FAISS. 3) Run query benchmarks — recall vs latency. 4) Plot tradeoffs.
Stack
FAISS, sentence-transformers, OpenAI API (optional), datasets
Datasets
BeIR/scifact (5k docs) or ms_marco (100k passages slice)
Output
Recall@10 vs query-latency curves for 3 models × 2 index types.
How to Test
Use BeIR's labeled qrels; compute NDCG@10.
Talking Points
Why HNSW dominates production. PQ for memory-bound deployments. The dim-vs-quality curve.
Resume Bullet
"Benchmarked 3 embedding models × 2 FAISS index types on BeIR/SciFact (NDCG@10), producing reproducible recall-vs-latency tradeoff curves."
1) Pick a corpus (your company docs, PubMed abstracts, EU AI Act). 2) Recursive chunking with overlap. 3) Embed + index (Qdrant). 4) Retrieval → context formatting → answer generation with citations. 5) Streaming response via SSE. 6) Wrap in FastAPI.
A working /query endpoint that returns answers with chunk-level citations.
How to Test
30 hand-crafted Q&A pairs; faithfulness evaluated manually + with RAGAS in Lab 4.
Talking Points
Chunking-strategy tradeoffs. Why citations matter (auditability). Streaming vs full response.
Resume Bullet
"Built a production RAG service over a 12k-document corpus with recursive chunking, Qdrant HNSW retrieval, streaming generation, and chunk-level citations exposed via FastAPI + SSE."
A retrieval-quality table; updated production pipeline.
How to Test
NDCG@10 hybrid+rerank > dense-only by ≥ 5 points.
Talking Points
Why BM25 is still the best baseline (lexical match for proper nouns). Why re-rankers are slow (full cross-attention) — only over top-K. ColBERT as a middle ground.
Resume Bullet
"Augmented dense retrieval with BM25 + RRF fusion + BGE cross-encoder re-ranking, lifting NDCG@10 from 0.41 to 0.58 on BeIR/SciFact at 38ms additional P99 latency."
1) Define 3 tools: search_docs(query), calculator(expr), fetch_url(url). 2) Implement ReAct loop manually (no LangChain magic). 3) Use OpenAI function-calling format OR Outlines for constrained output. 4) Add iteration cap + tool-error handling. 5) Trace every tool call to a JSON log.
Stack
OpenAI / Anthropic / local model with function calling; Outlines or lm-format-enforcer
Output
A CLI agent that can answer "What's the GDP per capita of France divided by the population of Paris?" using tools.
How to Test
10 multi-step tasks; success rate measured.
Talking Points
Why constrained decoding > regex parsing JSON. Why agents fail (compounding errors, infinite loops). When NOT to use an agent.
Resume Bullet
"Implemented a ReAct-style tool-using agent (RAG + calculator + web fetch) with JSON-schema constrained decoding, full per-call tracing, and bounded iteration; 8/10 success on multi-hop reasoning evals."