Capstone 02 — Production RAG Service

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐☆ | Time: 1–2 weeks

Demonstrates you can ship a real, deployable RAG system — not a notebook demo. Includes hybrid search, reranking, evals, observability, and a UI.

Goals

Index a real corpus of 5–50k documents (e.g., arXiv ML papers, your company's docs, a Wikipedia dump).
Ship a FastAPI service with streaming SSE responses and inline citations.
Use hybrid retrieval (dense + BM25, reciprocal rank fusion) and a cross-encoder reranker.
Evaluate with RAGAS and report faithfulness, context-precision, answer-relevancy.
Provide a Streamlit UI for human evaluation and demo.
Containerize with Docker Compose: API + Qdrant + UI.

Architecture

   ┌──────────────┐   ┌────────────────────┐   ┌────────────────┐
   │ Streamlit UI │──▶│  FastAPI gateway   │──▶│ vLLM / OpenAI  │
   └──────────────┘   │  - SSE streaming   │   │ (LLM backend)  │
                      │  - hybrid retrieval│   └────────────────┘
                      │  - reranker        │
                      └────────┬───────────┘
                               │
                ┌──────────────┼──────────────┐
                ▼              ▼              ▼
         ┌──────────┐   ┌──────────┐   ┌──────────────┐
         │ Qdrant   │   │ BM25     │   │ bge-reranker │
         │ (dense)  │   │ (sparse) │   │ (cross-enc)  │
         └──────────┘   └──────────┘   └──────────────┘
                               │
                               ▼
                ┌──────────────────────────────┐
                │ Ingestion pipeline (Phase 7) │
                │  - chunk → embed → upsert    │
                └──────────────────────────────┘

  Observability: OpenTelemetry → console / Jaeger
  Eval: RAGAS over 100 (question, ground-truth) pairs

Suggested Stack

Component	Choice
Embeddings	`BAAI/bge-small-en-v1.5` (384d, normalized)
Vector DB	Qdrant (HNSW + cosine)
Sparse retrieval	`rank_bm25`
Reranker	`BAAI/bge-reranker-base` (cross-encoder)
LLM	local vLLM (Llama-3-8B) or OpenAI-compatible
API	FastAPI + SSE
UI	Streamlit
Eval	RAGAS (faithfulness, context-recall, answer-relevancy)
Observability	OpenTelemetry traces
Deploy	Docker Compose (API + Qdrant + UI)

Deliverables Checklist

ingest.py — chunk + embed + index pipeline (token-aware chunks, 400 tokens, 80 overlap)
retrieve.py — hybrid dense + BM25, RRF fusion, then cross-encoder rerank to top-5
serve.py — FastAPI with /chat (SSE), /health, /metrics
ui/app.py — Streamlit demo with citation panel
eval/ragas_eval.py — runs RAGAS on a curated 100-question eval set
evalset.jsonl — 100 (question, ground-truth-answer, ground-truth-source) triples
EVAL_REPORT.md — table of RAGAS scores; ablation: dense-only vs hybrid vs hybrid+rerank
docker-compose.yml — one-command bring-up
ARCHITECTURE.md — component diagram + sequence diagram for a query
WRITEUP.md — choices, trade-offs, what failed first
Live demo (loom or screencast)

Resume Bullet Pattern

Built and shipped a production RAG service over 25k arXiv ML papers achieving 0.84 faithfulness on RAGAS via hybrid (dense + BM25) retrieval, cross-encoder reranking, and SSE-streamed citations; containerized with Docker Compose; <300ms median TTFT. [demo + repo]

Interview Talking Points

Chunking strategy: token-aware, overlap, structural awareness. When you'd use parent-document retrieval.
Hybrid retrieval & RRF: how reciprocal rank fusion combines incomparable scores; tunable weighting.
Reranker tradeoffs: cross-encoder latency vs precision; when to skip reranking.
Hallucination mitigation: system prompt design, refusal clauses, citation grounding.
Eval methodology: why RAGAS, what each metric captures, where it lies.
Streaming SSE vs WebSockets: why SSE for LLM streaming.
Observability: latency p50/p95/p99 per stage (retrieval, rerank, LLM).
What you'd add at 10× scale: query rewriting (HyDE), multi-hop, semantic caching, learning-to-rank.

Getting Started

Pick your corpus. arXiv ML papers (HuggingFace dataset) is the easy default; your own docs are higher signal.
Run Phase-7 lab-02 first end-to-end. Convince yourself the basic pipeline works.
Add BM25 alongside Qdrant; combine with RRF (k=60 is the standard constant).
Add the reranker as a post-processing step on top-20 → top-5.
Build the eval set: 100 questions you (or a colleague) can ground-truth. Mix factual, multi-hop, "not in corpus".
Run RAGAS for each retrieval variant (dense, hybrid, hybrid+rerank); record numbers.
Add OpenTelemetry traces for each request: trace ID propagated through retrieve → rerank → LLM.
Write the Streamlit UI last — it's mostly glue.
Compose it all in Docker. Verify cold-start works on a fresh machine.
Record a demo. Most hiring managers will not run your code; they will watch the video.

LLM Inference Engineer