Capstone 02 — Production RAG Service
Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐☆ | Time: 1–2 weeks
Demonstrates you can ship a real, deployable RAG system — not a notebook demo. Includes hybrid search, reranking, evals, observability, and a UI.
Goals
- Index a real corpus of 5–50k documents (e.g., arXiv ML papers, your company's docs, a Wikipedia dump).
- Ship a FastAPI service with streaming SSE responses and inline citations.
- Use hybrid retrieval (dense + BM25, reciprocal rank fusion) and a cross-encoder reranker.
- Evaluate with RAGAS and report faithfulness, context-precision, answer-relevancy.
- Provide a Streamlit UI for human evaluation and demo.
- Containerize with Docker Compose: API + Qdrant + UI.
Architecture
┌──────────────┐ ┌────────────────────┐ ┌────────────────┐
│ Streamlit UI │──▶│ FastAPI gateway │──▶│ vLLM / OpenAI │
└──────────────┘ │ - SSE streaming │ │ (LLM backend) │
│ - hybrid retrieval│ └────────────────┘
│ - reranker │
└────────┬───────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────┐ ┌──────────┐ ┌──────────────┐
│ Qdrant │ │ BM25 │ │ bge-reranker │
│ (dense) │ │ (sparse) │ │ (cross-enc) │
└──────────┘ └──────────┘ └──────────────┘
│
▼
┌──────────────────────────────┐
│ Ingestion pipeline (Phase 7) │
│ - chunk → embed → upsert │
└──────────────────────────────┘
Observability: OpenTelemetry → console / Jaeger
Eval: RAGAS over 100 (question, ground-truth) pairs
Suggested Stack
| Component | Choice |
|---|---|
| Embeddings | BAAI/bge-small-en-v1.5 (384d, normalized) |
| Vector DB | Qdrant (HNSW + cosine) |
| Sparse retrieval | rank_bm25 |
| Reranker | BAAI/bge-reranker-base (cross-encoder) |
| LLM | local vLLM (Llama-3-8B) or OpenAI-compatible |
| API | FastAPI + SSE |
| UI | Streamlit |
| Eval | RAGAS (faithfulness, context-recall, answer-relevancy) |
| Observability | OpenTelemetry traces |
| Deploy | Docker Compose (API + Qdrant + UI) |
Deliverables Checklist
-
ingest.py— chunk + embed + index pipeline (token-aware chunks, 400 tokens, 80 overlap) -
retrieve.py— hybrid dense + BM25, RRF fusion, then cross-encoder rerank to top-5 -
serve.py— FastAPI with/chat(SSE),/health,/metrics -
ui/app.py— Streamlit demo with citation panel -
eval/ragas_eval.py— runs RAGAS on a curated 100-question eval set -
evalset.jsonl— 100 (question, ground-truth-answer, ground-truth-source) triples -
EVAL_REPORT.md— table of RAGAS scores; ablation: dense-only vs hybrid vs hybrid+rerank -
docker-compose.yml— one-command bring-up -
ARCHITECTURE.md— component diagram + sequence diagram for a query -
WRITEUP.md— choices, trade-offs, what failed first - Live demo (loom or screencast)
Resume Bullet Pattern
Built and shipped a production RAG service over 25k arXiv ML papers achieving 0.84 faithfulness on RAGAS via hybrid (dense + BM25) retrieval, cross-encoder reranking, and SSE-streamed citations; containerized with Docker Compose; <300ms median TTFT. [demo + repo]
Interview Talking Points
- Chunking strategy: token-aware, overlap, structural awareness. When you'd use parent-document retrieval.
- Hybrid retrieval & RRF: how reciprocal rank fusion combines incomparable scores; tunable weighting.
- Reranker tradeoffs: cross-encoder latency vs precision; when to skip reranking.
- Hallucination mitigation: system prompt design, refusal clauses, citation grounding.
- Eval methodology: why RAGAS, what each metric captures, where it lies.
- Streaming SSE vs WebSockets: why SSE for LLM streaming.
- Observability: latency p50/p95/p99 per stage (retrieval, rerank, LLM).
- What you'd add at 10× scale: query rewriting (HyDE), multi-hop, semantic caching, learning-to-rank.
Getting Started
- Pick your corpus. arXiv ML papers (HuggingFace dataset) is the easy default; your own docs are higher signal.
- Run Phase-7 lab-02 first end-to-end. Convince yourself the basic pipeline works.
- Add BM25 alongside Qdrant; combine with RRF (k=60 is the standard constant).
- Add the reranker as a post-processing step on top-20 → top-5.
- Build the eval set: 100 questions you (or a colleague) can ground-truth. Mix factual, multi-hop, "not in corpus".
- Run RAGAS for each retrieval variant (dense, hybrid, hybrid+rerank); record numbers.
- Add OpenTelemetry traces for each request: trace ID propagated through retrieve → rerank → LLM.
- Write the Streamlit UI last — it's mostly glue.
- Compose it all in Docker. Verify cold-start works on a fresh machine.
- Record a demo. Most hiring managers will not run your code; they will watch the video.