05 — Eval Platform (Continuous + LLM-Judge)

Roles: Model Evaluation Engineer · Trust & Safety Engineer

1. Requirements

  • Run benchmarks on every model checkpoint (continuous eval)
  • Mix: classic benchmarks (MMLU, GSM8K, HumanEval), task-specific suites, LLM-judge head-to-head, human eval (sampled), red-team
  • Reproducible; comparable across time
  • Block bad checkpoints from promotion

2. Architecture

[Checkpoint event] → [Eval orchestrator]
        │
        ├──► [Likelihood-based eval] (lm-eval-harness shape)
        ├──► [Generation eval] (vLLM batched)
        ├──► [LLM judge] (head-to-head vs reference)
        ├──► [Code eval] (sandboxed exec — gVisor/Firecracker)
        └──► [Red-team prompts] (jailbreaks, harmful refusals)
                │
                ▼
          [Results DB] → [Dashboard] → [Promotion gate]

3. Deep Dives

3.1 Reproducibility

  • Pin: model commit, tokenizer, eval-harness version, prompt templates, sampling params (or temp=0)
  • Cache predictions keyed on (model_hash, prompt_hash, sampling_hash) → expensive evals run once
  • Random seed everything

3.2 LLM-as-Judge

  • Use a different and stronger model as judge
  • Pairwise (A vs B), randomized order to defeat positional bias
  • Rubric in system prompt; chain-of-thought encouraged
  • Validate the judge: 100-item human-labeled set; require κ > 0.7 with humans before trusting it
  • Beware: judges have known biases (verbosity, sycophancy, self-preference)

3.3 Code Eval Safety

  • Untrusted code in sandbox (gVisor, Firecracker, or Docker w/ seccomp + no-net)
  • Time limits (10s/test) + memory limits + syscall denylist
  • Never run untrusted generated code on shared infra without isolation

3.4 Red-Team

  • Static suite of jailbreak attempts + harmful requests
  • Track: refusal-rate on harmful, over-refusal on benign (the dual)
  • Periodically refresh with new jailbreaks from research/Twitter

3.5 Statistical Rigor

  • Bootstrap CIs on accuracy
  • For pairwise: Wilson interval on win-rate, n ≥ 200 to detect 5% diffs
  • McNemar's test for paired comparisons

4. Promotion Gate Rules (example)

  • No eval can regress > 1% absolute vs current production
  • LLM-judge win-rate must be ≥ 50% (with CI not below 45%)
  • Refusal-on-harmful ≥ 99%; over-refusal ≤ 5%
  • Manual override requires PR with justification

5. Cost

  • Full eval suite ~$200-500 per checkpoint (LLM-judge dominates)
  • Cache aggressively