03 — RAG at Scale (100M docs, 1k QPS)
Roles: Applied AI Engineer · Search/RAG Engineer · LLM Infrastructure
1. Clarifying Questions
- Corpus size & growth rate? Update frequency (hourly/daily/static)?
- Query latency SLO? (Typical: e2e p95 < 1.5s, retrieval p95 < 100ms.)
- Multi-tenant (per-tenant indices)? Permission filters?
- Quality target? (Faithfulness, answer relevance via RAGAS.)
2. Capacity Estimation
- 100M docs × ~5 chunks/doc = 500M chunks
- Embedding dim 768 × 4 bytes = 3 KB/vector → 1.5 TB raw vectors
- HNSW index (M=32) ≈ 2× raw → ~3 TB → shard across nodes
- 1k QPS × top-50 retrieval × HNSW (~10ms cold) → ~10 search nodes minimum
3. Architecture
Query ─► [API] ─► [Hybrid Retriever]
├── BM25 (Elastic/OpenSearch)
└── Vector (Qdrant/Vespa/Milvus, sharded HNSW)
└─► [Reranker (cross-encoder)]
└─► [LLM (vLLM)] ─► streamed answer + citations
4. Deep Dives
4.1 Indexing Pipeline
- Ingest events on Kafka → workers chunk (token-aware, 200-400 tok with overlap)
- Embed in batched workers (GPU pool, batch_size 64)
- Upsert to vector store with metadata (tenant_id, doc_id, ACL hash)
- BM25 index updated in parallel
- Backfill via Spark job for full re-embeds when changing model
4.2 Hybrid Retrieval (BM25 + Dense)
- Run both, take top-50 from each, merge with Reciprocal Rank Fusion
- BM25 catches exact-match terms (names, IDs); dense catches paraphrase
- ~10-15% improvement over either alone
4.3 Reranking
- Cross-encoder (bge-reranker-large or similar) on top-50 → top-5
- Adds ~50-100ms but biggest single quality lever
- Run in dedicated GPU pool, batch 32
4.4 Caching
- Query → answer cache (Redis, TTL 24h, semantic-similar key)
- Embedding cache for repeated queries
- LLM prefix cache via vLLM for shared system prompt
4.5 Permissions & Multi-tenancy
- Filter at vector-store query time (
WHERE tenant_id = X AND acl_hash IN (...)) - Never filter post-hoc on retrieved docs (you'll under-retrieve)
- For huge ACL sets, use payload-bitmap or per-tenant collections
5. Eval (continuous!)
- Golden set: 500 (query, doc, answer) tuples human-labeled
- Run nightly: recall@10 on retrieval, RAGAS faithfulness/answer-relevance on generation
- Block deploys on regression
6. Tradeoffs
| Choice | Alt | When |
|---|---|---|
| Qdrant | Vespa, Milvus, Weaviate, pgvector | Vespa for hybrid built-in; pgvector for <10M scale |
| Cross-encoder rerank | LLM-as-reranker | Cross-encoder is 100× cheaper |
| Per-tenant index | Shared index + filter | Shared scales better past ~10k tenants |
7. Pitch
"Hybrid BM25 + dense (Qdrant, sharded HNSW) → cross-encoder rerank → vLLM with prefix cache. 500M chunks across 8 search nodes; ingest via Kafka + GPU embed pool; ACL filter at query time. Continuous eval on a 500-tuple golden set, RAGAS faithfulness as the headline metric. p95 e2e < 1.5s including streamed first token."