03 — RAG at Scale (100M docs, 1k QPS)

Roles: Applied AI Engineer · Search/RAG Engineer · LLM Infrastructure

1. Clarifying Questions

Corpus size & growth rate? Update frequency (hourly/daily/static)?
Query latency SLO? (Typical: e2e p95 < 1.5s, retrieval p95 < 100ms.)
Multi-tenant (per-tenant indices)? Permission filters?
Quality target? (Faithfulness, answer relevance via RAGAS.)

2. Capacity Estimation

100M docs × ~5 chunks/doc = 500M chunks
Embedding dim 768 × 4 bytes = 3 KB/vector → 1.5 TB raw vectors
HNSW index (M=32) ≈ 2× raw → ~3 TB → shard across nodes
1k QPS × top-50 retrieval × HNSW (~10ms cold) → ~10 search nodes minimum

3. Architecture

Query ─► [API] ─► [Hybrid Retriever]
                       ├── BM25 (Elastic/OpenSearch)
                       └── Vector (Qdrant/Vespa/Milvus, sharded HNSW)
                  └─► [Reranker (cross-encoder)]
                  └─► [LLM (vLLM)]   ─► streamed answer + citations

4. Deep Dives

4.1 Indexing Pipeline

Ingest events on Kafka → workers chunk (token-aware, 200-400 tok with overlap)
Embed in batched workers (GPU pool, batch_size 64)
Upsert to vector store with metadata (tenant_id, doc_id, ACL hash)
BM25 index updated in parallel
Backfill via Spark job for full re-embeds when changing model

4.2 Hybrid Retrieval (BM25 + Dense)

Run both, take top-50 from each, merge with Reciprocal Rank Fusion
BM25 catches exact-match terms (names, IDs); dense catches paraphrase
~10-15% improvement over either alone

4.3 Reranking

Cross-encoder (bge-reranker-large or similar) on top-50 → top-5
Adds ~50-100ms but biggest single quality lever
Run in dedicated GPU pool, batch 32

4.4 Caching

Query → answer cache (Redis, TTL 24h, semantic-similar key)
Embedding cache for repeated queries
LLM prefix cache via vLLM for shared system prompt

4.5 Permissions & Multi-tenancy

Filter at vector-store query time (WHERE tenant_id = X AND acl_hash IN (...))
Never filter post-hoc on retrieved docs (you'll under-retrieve)
For huge ACL sets, use payload-bitmap or per-tenant collections

5. Eval (continuous!)

Golden set: 500 (query, doc, answer) tuples human-labeled
Run nightly: recall@10 on retrieval, RAGAS faithfulness/answer-relevance on generation
Block deploys on regression

6. Tradeoffs

Choice	Alt	When
Qdrant	Vespa, Milvus, Weaviate, pgvector	Vespa for hybrid built-in; pgvector for <10M scale
Cross-encoder rerank	LLM-as-reranker	Cross-encoder is 100× cheaper
Per-tenant index	Shared index + filter	Shared scales better past ~10k tenants

7. Pitch

"Hybrid BM25 + dense (Qdrant, sharded HNSW) → cross-encoder rerank → vLLM with prefix cache. 500M chunks across 8 search nodes; ingest via Kafka + GPU embed pool; ACL filter at query time. Continuous eval on a 500-tuple golden set, RAGAS faithfulness as the headline metric. p95 e2e < 1.5s including streamed first token."

LLM Inference Engineer