Capstone 02 — Production RAG Service

Phase: 11 — Capstone | Difficulty: ⭐⭐⭐⭐☆ | Time: 1–2 weeks

Demonstrates you can ship a real, deployable RAG system — not a notebook demo. Includes hybrid search, reranking, evals, observability, and a UI.


Goals

  1. Index a real corpus of 5–50k documents (e.g., arXiv ML papers, your company's docs, a Wikipedia dump).
  2. Ship a FastAPI service with streaming SSE responses and inline citations.
  3. Use hybrid retrieval (dense + BM25, reciprocal rank fusion) and a cross-encoder reranker.
  4. Evaluate with RAGAS and report faithfulness, context-precision, answer-relevancy.
  5. Provide a Streamlit UI for human evaluation and demo.
  6. Containerize with Docker Compose: API + Qdrant + UI.

Architecture

   ┌──────────────┐   ┌────────────────────┐   ┌────────────────┐
   │ Streamlit UI │──▶│  FastAPI gateway   │──▶│ vLLM / OpenAI  │
   └──────────────┘   │  - SSE streaming   │   │ (LLM backend)  │
                      │  - hybrid retrieval│   └────────────────┘
                      │  - reranker        │
                      └────────┬───────────┘
                               │
                ┌──────────────┼──────────────┐
                ▼              ▼              ▼
         ┌──────────┐   ┌──────────┐   ┌──────────────┐
         │ Qdrant   │   │ BM25     │   │ bge-reranker │
         │ (dense)  │   │ (sparse) │   │ (cross-enc)  │
         └──────────┘   └──────────┘   └──────────────┘
                               │
                               ▼
                ┌──────────────────────────────┐
                │ Ingestion pipeline (Phase 7) │
                │  - chunk → embed → upsert    │
                └──────────────────────────────┘

  Observability: OpenTelemetry → console / Jaeger
  Eval: RAGAS over 100 (question, ground-truth) pairs

Suggested Stack

ComponentChoice
EmbeddingsBAAI/bge-small-en-v1.5 (384d, normalized)
Vector DBQdrant (HNSW + cosine)
Sparse retrievalrank_bm25
RerankerBAAI/bge-reranker-base (cross-encoder)
LLMlocal vLLM (Llama-3-8B) or OpenAI-compatible
APIFastAPI + SSE
UIStreamlit
EvalRAGAS (faithfulness, context-recall, answer-relevancy)
ObservabilityOpenTelemetry traces
DeployDocker Compose (API + Qdrant + UI)

Deliverables Checklist

  • ingest.py — chunk + embed + index pipeline (token-aware chunks, 400 tokens, 80 overlap)
  • retrieve.py — hybrid dense + BM25, RRF fusion, then cross-encoder rerank to top-5
  • serve.py — FastAPI with /chat (SSE), /health, /metrics
  • ui/app.py — Streamlit demo with citation panel
  • eval/ragas_eval.py — runs RAGAS on a curated 100-question eval set
  • evalset.jsonl — 100 (question, ground-truth-answer, ground-truth-source) triples
  • EVAL_REPORT.md — table of RAGAS scores; ablation: dense-only vs hybrid vs hybrid+rerank
  • docker-compose.yml — one-command bring-up
  • ARCHITECTURE.md — component diagram + sequence diagram for a query
  • WRITEUP.md — choices, trade-offs, what failed first
  • Live demo (loom or screencast)

Resume Bullet Pattern

Built and shipped a production RAG service over 25k arXiv ML papers achieving 0.84 faithfulness on RAGAS via hybrid (dense + BM25) retrieval, cross-encoder reranking, and SSE-streamed citations; containerized with Docker Compose; <300ms median TTFT. [demo + repo]


Interview Talking Points

  • Chunking strategy: token-aware, overlap, structural awareness. When you'd use parent-document retrieval.
  • Hybrid retrieval & RRF: how reciprocal rank fusion combines incomparable scores; tunable weighting.
  • Reranker tradeoffs: cross-encoder latency vs precision; when to skip reranking.
  • Hallucination mitigation: system prompt design, refusal clauses, citation grounding.
  • Eval methodology: why RAGAS, what each metric captures, where it lies.
  • Streaming SSE vs WebSockets: why SSE for LLM streaming.
  • Observability: latency p50/p95/p99 per stage (retrieval, rerank, LLM).
  • What you'd add at 10× scale: query rewriting (HyDE), multi-hop, semantic caching, learning-to-rank.

Getting Started

  1. Pick your corpus. arXiv ML papers (HuggingFace dataset) is the easy default; your own docs are higher signal.
  2. Run Phase-7 lab-02 first end-to-end. Convince yourself the basic pipeline works.
  3. Add BM25 alongside Qdrant; combine with RRF (k=60 is the standard constant).
  4. Add the reranker as a post-processing step on top-20 → top-5.
  5. Build the eval set: 100 questions you (or a colleague) can ground-truth. Mix factual, multi-hop, "not in corpus".
  6. Run RAGAS for each retrieval variant (dense, hybrid, hybrid+rerank); record numbers.
  7. Add OpenTelemetry traces for each request: trace ID propagated through retrieve → rerank → LLM.
  8. Write the Streamlit UI last — it's mostly glue.
  9. Compose it all in Docker. Verify cold-start works on a fresh machine.
  10. Record a demo. Most hiring managers will not run your code; they will watch the video.