03 — RAG at Scale (100M docs, 1k QPS)

Roles: Applied AI Engineer · Search/RAG Engineer · LLM Infrastructure

1. Clarifying Questions

  • Corpus size & growth rate? Update frequency (hourly/daily/static)?
  • Query latency SLO? (Typical: e2e p95 < 1.5s, retrieval p95 < 100ms.)
  • Multi-tenant (per-tenant indices)? Permission filters?
  • Quality target? (Faithfulness, answer relevance via RAGAS.)

2. Capacity Estimation

  • 100M docs × ~5 chunks/doc = 500M chunks
  • Embedding dim 768 × 4 bytes = 3 KB/vector → 1.5 TB raw vectors
  • HNSW index (M=32) ≈ 2× raw → ~3 TB → shard across nodes
  • 1k QPS × top-50 retrieval × HNSW (~10ms cold) → ~10 search nodes minimum

3. Architecture

Query ─► [API] ─► [Hybrid Retriever]
                       ├── BM25 (Elastic/OpenSearch)
                       └── Vector (Qdrant/Vespa/Milvus, sharded HNSW)
                  └─► [Reranker (cross-encoder)]
                  └─► [LLM (vLLM)]   ─► streamed answer + citations

4. Deep Dives

4.1 Indexing Pipeline

  • Ingest events on Kafka → workers chunk (token-aware, 200-400 tok with overlap)
  • Embed in batched workers (GPU pool, batch_size 64)
  • Upsert to vector store with metadata (tenant_id, doc_id, ACL hash)
  • BM25 index updated in parallel
  • Backfill via Spark job for full re-embeds when changing model

4.2 Hybrid Retrieval (BM25 + Dense)

  • Run both, take top-50 from each, merge with Reciprocal Rank Fusion
  • BM25 catches exact-match terms (names, IDs); dense catches paraphrase
  • ~10-15% improvement over either alone

4.3 Reranking

  • Cross-encoder (bge-reranker-large or similar) on top-50 → top-5
  • Adds ~50-100ms but biggest single quality lever
  • Run in dedicated GPU pool, batch 32

4.4 Caching

  • Query → answer cache (Redis, TTL 24h, semantic-similar key)
  • Embedding cache for repeated queries
  • LLM prefix cache via vLLM for shared system prompt

4.5 Permissions & Multi-tenancy

  • Filter at vector-store query time (WHERE tenant_id = X AND acl_hash IN (...))
  • Never filter post-hoc on retrieved docs (you'll under-retrieve)
  • For huge ACL sets, use payload-bitmap or per-tenant collections

5. Eval (continuous!)

  • Golden set: 500 (query, doc, answer) tuples human-labeled
  • Run nightly: recall@10 on retrieval, RAGAS faithfulness/answer-relevance on generation
  • Block deploys on regression

6. Tradeoffs

ChoiceAltWhen
QdrantVespa, Milvus, Weaviate, pgvectorVespa for hybrid built-in; pgvector for <10M scale
Cross-encoder rerankLLM-as-rerankerCross-encoder is 100× cheaper
Per-tenant indexShared index + filterShared scales better past ~10k tenants

7. Pitch

"Hybrid BM25 + dense (Qdrant, sharded HNSW) → cross-encoder rerank → vLLM with prefix cache. 500M chunks across 8 search nodes; ingest via Kafka + GPU embed pool; ACL filter at query time. Continuous eval on a 500-tuple golden set, RAGAS faithfulness as the headline metric. p95 e2e < 1.5s including streamed first token."