05 — Eval Platform (Continuous + LLM-Judge)
Roles: Model Evaluation Engineer · Trust & Safety Engineer
1. Requirements
- Run benchmarks on every model checkpoint (continuous eval)
- Mix: classic benchmarks (MMLU, GSM8K, HumanEval), task-specific suites, LLM-judge head-to-head, human eval (sampled), red-team
- Reproducible; comparable across time
- Block bad checkpoints from promotion
2. Architecture
[Checkpoint event] → [Eval orchestrator]
│
├──► [Likelihood-based eval] (lm-eval-harness shape)
├──► [Generation eval] (vLLM batched)
├──► [LLM judge] (head-to-head vs reference)
├──► [Code eval] (sandboxed exec — gVisor/Firecracker)
└──► [Red-team prompts] (jailbreaks, harmful refusals)
│
▼
[Results DB] → [Dashboard] → [Promotion gate]
3. Deep Dives
3.1 Reproducibility
- Pin: model commit, tokenizer, eval-harness version, prompt templates, sampling params (or temp=0)
- Cache predictions keyed on (model_hash, prompt_hash, sampling_hash) → expensive evals run once
- Random seed everything
3.2 LLM-as-Judge
- Use a different and stronger model as judge
- Pairwise (A vs B), randomized order to defeat positional bias
- Rubric in system prompt; chain-of-thought encouraged
- Validate the judge: 100-item human-labeled set; require κ > 0.7 with humans before trusting it
- Beware: judges have known biases (verbosity, sycophancy, self-preference)
3.3 Code Eval Safety
- Untrusted code in sandbox (gVisor, Firecracker, or Docker w/ seccomp + no-net)
- Time limits (10s/test) + memory limits + syscall denylist
- Never run untrusted generated code on shared infra without isolation
3.4 Red-Team
- Static suite of jailbreak attempts + harmful requests
- Track: refusal-rate on harmful, over-refusal on benign (the dual)
- Periodically refresh with new jailbreaks from research/Twitter
3.5 Statistical Rigor
- Bootstrap CIs on accuracy
- For pairwise: Wilson interval on win-rate, n ≥ 200 to detect 5% diffs
- McNemar's test for paired comparisons
4. Promotion Gate Rules (example)
- No eval can regress > 1% absolute vs current production
- LLM-judge win-rate must be ≥ 50% (with CI not below 45%)
- Refusal-on-harmful ≥ 99%; over-refusal ≤ 5%
- Manual override requires PR with justification
5. Cost
- Full eval suite ~$200-500 per checkpoint (LLM-judge dominates)
- Cache aggressively