05 — Eval Platform (Continuous + LLM-Judge)

Roles: Model Evaluation Engineer · Trust & Safety Engineer

1. Requirements

Run benchmarks on every model checkpoint (continuous eval)
Mix: classic benchmarks (MMLU, GSM8K, HumanEval), task-specific suites, LLM-judge head-to-head, human eval (sampled), red-team
Reproducible; comparable across time
Block bad checkpoints from promotion

2. Architecture

[Checkpoint event] → [Eval orchestrator]
        │
        ├──► [Likelihood-based eval] (lm-eval-harness shape)
        ├──► [Generation eval] (vLLM batched)
        ├──► [LLM judge] (head-to-head vs reference)
        ├──► [Code eval] (sandboxed exec — gVisor/Firecracker)
        └──► [Red-team prompts] (jailbreaks, harmful refusals)
                │
                ▼
          [Results DB] → [Dashboard] → [Promotion gate]

3. Deep Dives

3.1 Reproducibility

Pin: model commit, tokenizer, eval-harness version, prompt templates, sampling params (or temp=0)
Cache predictions keyed on (model_hash, prompt_hash, sampling_hash) → expensive evals run once
Random seed everything

3.2 LLM-as-Judge

Use a different and stronger model as judge
Pairwise (A vs B), randomized order to defeat positional bias
Rubric in system prompt; chain-of-thought encouraged
Validate the judge: 100-item human-labeled set; require κ > 0.7 with humans before trusting it
Beware: judges have known biases (verbosity, sycophancy, self-preference)

3.3 Code Eval Safety

Untrusted code in sandbox (gVisor, Firecracker, or Docker w/ seccomp + no-net)
Time limits (10s/test) + memory limits + syscall denylist
Never run untrusted generated code on shared infra without isolation

3.4 Red-Team

Static suite of jailbreak attempts + harmful requests
Track: refusal-rate on harmful, over-refusal on benign (the dual)
Periodically refresh with new jailbreaks from research/Twitter

3.5 Statistical Rigor

Bootstrap CIs on accuracy
For pairwise: Wilson interval on win-rate, n ≥ 200 to detect 5% diffs
McNemar's test for paired comparisons

4. Promotion Gate Rules (example)

No eval can regress > 1% absolute vs current production
LLM-judge win-rate must be ≥ 50% (with CI not below 45%)
Refusal-on-harmful ≥ 99%; over-refusal ≤ 5%
Manual override requires PR with justification

5. Cost

Full eval suite ~$200-500 per checkpoint (LLM-judge dominates)
Cache aggressively

LLM Inference Engineer