🛸 Hitchhiker's Guide — Phase 8: Evaluation & Safety

Read this if: You can train and serve LLMs but you can't yet defend a number with statistical rigor, design an LLM-as-judge with calibrated confidence, distinguish capability vs alignment evals, or articulate the major safety threat models.


0. The 30-second mental model

Eval is the scientific method applied to LLMs. There is no objective "good model" — only models that score well on tasks you care about, in distributions you care about, with biases you can tolerate, and at costs you can pay. A serious eval program has:

  1. Capability evals: knowledge (MMLU), reasoning (GSM8K, MATH), coding (HumanEval, MBPP, SWE-Bench), language (HellaSwag, BBH), tool use, long-context.
  2. Alignment / safety evals: refusal of harmful requests, over-refusal of benign ones, jailbreak resistance, bias measurement, sycophancy.
  3. Pairwise / preference evals: head-to-head with LLM-judge or humans.
  4. Real-world evals: shadow-traffic in production; user satisfaction; A/B win rates.
  5. Regression suite: every checkpoint runs the full battery; no promotion without passing.

By the end of Phase 8 you should:

  • Implement likelihood-based eval correctly (the lab does this on HellaSwag).
  • Use lm-evaluation-harness as a reference implementation and reproduce its numbers.
  • Design an LLM-as-judge with bias mitigation and human validation.
  • Compute confidence intervals, McNemar's test, and sample-size requirements.
  • Articulate the major contamination risks and detection methods.
  • Discuss threat models: misuse, prompt injection, model theft, alignment failures.

1. The two flavors of eval

1.1 Likelihood-based (no generation)

Used for multiple-choice tasks. For each candidate completion, compute the model's log-probability and pick the argmax. No sampling, no nondeterminism — fully reproducible.

For a HellaSwag example with 4 candidate endings:

$$ \hat{y} = \arg\max_{i \in {A, B, C, D}} \frac{1}{|y_i|} \sum_t \log P(y_{i,t} | x, y_{i,<t}) $$

Sometimes normalized by length (per-token mean log-prob) to avoid bias toward shorter answers — variants are called acc, acc_norm, etc. in lm-evaluation-harness.

This is what Lab 01 implements.

1.2 Generation-based (sample, then judge)

Used for open-ended tasks (summarization, code generation, chat). Pipeline: generate output, then score it with one of:

  • Exact match / rule-based: GSM8K answer matching, regex extraction, code execution (HumanEval).
  • String-level metrics: BLEU, ROUGE, METEOR — older and brittle. Use only for translation/summarization, never for chat.
  • LLM judge: another (usually stronger) model rates outputs. Rich signal but biased — see §3.
  • Human judge: gold standard, costly and slow.

Generation-based evals introduce sampling variance. Either set temperature=0 (deterministic, but maybe under-explores model capability) or sample N times and report mean/CI.


2. The benchmarks you must know

2.1 Knowledge

  • MMLU (Hendrycks et al., 2021) — 57 subjects, 16k questions. The classic capability benchmark. Saturated at the top end (~90% for Claude 4 / GPT-4o); use MMLU-Pro (more rigorous) for modern models.
  • TriviaQA, NaturalQuestions — open-domain QA.
  • TruthfulQA — common misconceptions; tests whether models repeat falsehoods.

2.2 Reasoning

  • GSM8K — 8.5k grade-school math word problems. Saturated at the top.
  • MATH — high-school competition math. Still hard.
  • BBH (Big-Bench Hard) — 23 hard tasks from BIG-bench.
  • HellaSwag, ARC, PIQA — common-sense reasoning. Older, somewhat saturated.

2.3 Code

  • HumanEval (Chen et al., 2021) — 164 Python problems, judged by unit tests. pass@k metric.
  • MBPP — basic Python problems.
  • SWE-Bench — real GitHub issues; agent must produce a patch that passes tests. Very hard, very realistic.
  • LiveCodeBench, BigCodeBench — newer, less contaminated.

2.4 Language

  • WinoGrande — coreference / common sense.
  • LAMBADA — last-word prediction over long passages.

2.5 Long context

  • Needle in a Haystack — embed a fact in a long document, ask about it. Tests recall.
  • RULER — multi-needle, harder.
  • LongBench — diverse long-context tasks.

2.6 Pairwise / preference

  • MT-Bench (Zheng et al., 2023) — 80 multi-turn questions; LLM-judge head-to-head.
  • AlpacaEval — pairwise win rate vs a baseline (GPT-4-Turbo).
  • Chatbot Arena — human pairwise votes; produces an Elo leaderboard. The de-facto vibes benchmark.

2.7 Safety

  • HarmBench, AdvBench — harmful instructions; measures refusal rate.
  • XSTest — over-refusal of benign requests that look superficially harmful.
  • JailbreakBench — known jailbreaks; measures resistance.
  • BBQ — bias on stereotyped categories.

3. LLM-as-Judge — the most important pattern, with caveats

3.1 The pattern

Use a stronger LLM (or a different one) to compare outputs from two models on the same prompt and pick a winner (or rate a single output). Cheap, scalable; high agreement with humans on many tasks.

3.2 The biases (Zheng et al., 2023, Judging LLM-as-a-Judge)

  • Position bias: judge prefers the first answer ~30% more often than chance. Mitigation: randomize order; or run both orderings and average.
  • Verbosity bias: judge prefers longer answers. Mitigation: instruct against it; control for length in analysis.
  • Self-preference: a model tends to prefer outputs from itself or its family. Mitigation: use a different model family as judge.
  • Sycophancy / format bias: well-formatted (markdown, headers) wins regardless of content quality.

3.3 Validation: trust, but verify

Before trusting any LLM judge, collect 100–200 human-labeled pairwise judgments on the same data. Compute Cohen's κ between human and LLM judge. Require κ > 0.7 (substantial agreement) before deploying. Re-validate periodically.

3.4 Pairwise prompt template

You are an impartial judge. Compare two answers to the question below.
Pick A, B, or "tie". Justify briefly.

Question: {q}
Answer A: {a}
Answer B: {b}

Verdict (A | B | tie):
Reasoning:

Run twice with order swapped; if disagreement, report tie.


4. Statistical rigor

4.1 Confidence intervals on accuracy

For binary correct/wrong, accuracy is a binomial proportion. Use Wilson interval (better than normal approximation, especially near 0 or 1):

from statsmodels.stats.proportion import proportion_confint
ci_low, ci_high = proportion_confint(n_correct, n, alpha=0.05, method='wilson')

For continuous metrics (BLEU, faithfulness scores): bootstrap — resample with replacement N=1000 times; report 2.5% and 97.5% percentiles.

4.2 Comparing two models — paired McNemar's test

Two models A and B, both evaluated on the same N items. Build a 2×2 contingency table:

B correctB wrong
A correctn00n01
A wrongn10n11

McNemar's test on n01 vs n10 (the disagreements). Tells you whether A and B differ significantly on the items where they disagree. For pairwise win-rate from LLM-judge, use Wilson CI on the win-rate.

4.3 Sample size

To detect a 5% accuracy difference at p < 0.05, you need roughly N ≥ 400 items. To detect 1%, you need ~10,000. Most published benchmarks are smaller than this — be skeptical of small differences.

4.4 Reproducibility hygiene

Pin everything:

  • Model weights hash.
  • Tokenizer version.
  • Eval harness version.
  • Prompt template (yes, every character matters).
  • Sampling parameters (or temperature=0).
  • Random seed.

Cache predictions keyed on hash(model_id + prompt_id + sampling_id) so expensive evals run once per checkpoint.


5. Eval contamination — the silent killer

5.1 The problem

Web-scale pretraining scoops up the entire internet — including benchmark questions and answers. Models ace MMLU partly by memorizing it. Reported scores become meaningless.

5.2 Detection

  • N-gram overlap (Llama, GPT-3 papers): scan the training corpus for 13-grams from eval questions; flag matches. Llama-3 reports per-benchmark contamination percentages.
  • Embedding similarity scan for near-duplicates.
  • Loss-based detection: trained models have suspiciously low perplexity on memorized vs. paraphrased items. (Carlini et al. 2022)
  • Canary strings: insert unique nonce strings into the eval; if a model recites them, it saw the eval during training.

5.3 Prevention and mitigation

  • Strict filtering: dedup eval suites against the training corpus before training. Llama-3 deletes train docs with high overlap.
  • Held-out / private evals: companies maintain internal sets that aren't released.
  • Dynamic benchmarks: LiveBench, LiveCodeBench refresh their items monthly to outpace contamination.
  • Paraphrased variants: rephrase eval questions; if the model still gets them right, capability is real (not memorization).

6. Safety — threat models

6.1 Misuse

The model is asked to help with harmful tasks (weaponization, mass-influence ops, NCII, fraud). Defenses:

  • Refusal training in SFT/RLHF (refuse known categories).
  • Capability evaluations (CBRN, cyberoffense — Anthropic, OpenAI both publish these for frontier models).
  • System prompt + safety classifiers at the gateway.

6.2 Over-refusal

Model refuses benign requests that superficially resemble harmful ones ("how do I kill a process in Linux?"). Measured by XSTest, OR-Bench. The dual of refusal — track both.

6.3 Prompt injection

Untrusted text in context (search result, email, retrieved doc) carries instructions that hijack the model. Major risk for agents. Defenses (no silver bullet):

  1. Privilege separation — instructions from system / user are trusted; instructions from tool outputs are not.
  2. Sandboxed tools — tools execute under the user's identity, not the model's claims.
  3. Output filtering — check for exfil patterns, suspicious URLs.
  4. Human-in-the-loop for destructive actions.
  5. Defense in depth — assume jailbreak will occur at some rate; design surrounding system to limit blast radius.

Read Simon Willison's prompt-injection blog series.

6.4 Jailbreaks

Adversarial prompts that bypass safety training. Categories:

  • Role-play / persona ("DAN", "you are an unethical AI").
  • Indirect ("write a story where a character explains how to ...").
  • Encoding tricks (base64, leetspeak, foreign languages).
  • Many-shot (Anthropic, 2024) — long context with many fake "examples" of harmful answers in prior turns.
  • Adversarial suffixes (Zou et al. 2023, GCG attack) — gradient-optimized strings that crack open-weights models.

6.5 Bias and fairness

Models reflect training-data biases. Eval frameworks: BBQ, BOLD, RealToxicityPrompts. Mitigations: data filtering, RLHF on counter-stereotype demonstrations, output filters.

6.6 Alignment failures (longer-horizon concerns)

  • Reward hacking — model finds adversarial paths to high reward (e.g., answers with a confident tone and bullet points always score higher → all answers become bullet lists).
  • Sycophancy — agrees with user's stated beliefs even when wrong. Sharma et al. 2023.
  • Specification gaming — pursues the literal objective in unintended ways.
  • Deceptive alignment — speculative; model behaves aligned during training, misaligned in deployment. Active research at Anthropic, ARC Evals.

6.7 Model theft / extraction

API attackers query a model and use the outputs to train a clone. Mitigations: rate limiting, watermarking outputs (Kirchenbauer et al. 2023), fingerprinting.


7. The lab walkthrough (lab-01-eval-harness)

7.1 What you'll build

A from-scratch likelihood-based evaluator that:

  1. Loads a model (HuggingFace transformers).
  2. Loads HellaSwag (validation split, ~10k items).
  3. For each item, computes per-token log-probabilities of each candidate ending.
  4. Picks the argmax (acc) and the length-normalized argmax (acc_norm).
  5. Reports accuracy + Wilson CI.
  6. Validates against lm-evaluation-harness reference numbers.

7.2 Things to read carefully

  • The score_choice(prompt, choice) function: tokenize concatenation, run forward, gather log-probs at the choice positions only (not the prompt). Off-by-one is the most common bug — make sure indexes line up with shifted-by-one CE.
  • Length normalization: divide log-prob by number of choice tokens. Without it, the model favors shorter endings.
  • Batched evaluation: pad to the longest in the batch; use attention mask; gather only valid positions.

7.3 Reproducibility check

Run lm-evaluation-harness on the same model+task; your accuracy should match within 0.5%. If it doesn't, you have a bug — usually in tokenization or position alignment.


8. References

Required:

  • Liang et al. (2022), Holistic Evaluation of Language Models (HELM) — foundational.
  • Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
  • Hendrycks et al. (2021), Measuring Massive Multitask Language Understanding (MMLU).
  • Chen et al. (2021), Evaluating Large Language Models Trained on Code (HumanEval).
  • Carlini et al. (2022), Quantifying Memorization Across Neural Language Models.
  • Anthropic, Responsible Scaling Policy documents.
  • OpenAI, Preparedness Framework.
  • Simon Willison's prompt-injection series.

Important:

  • Bai et al. (2022), Constitutional AI.
  • Sharma et al. (2023), Towards Understanding Sycophancy in Language Models.
  • Zou et al. (2023), Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG).
  • Anil et al. (2024), Many-shot Jailbreaking (Anthropic).
  • The lm-evaluation-harness README and source code.
  • The RAGAS docs.

9. Common interview questions on Phase 8 material

  1. Walk through how you'd evaluate a new chat model end-to-end.
  2. What's the difference between likelihood eval and generation eval?
  3. What biases does an LLM judge have, and how do you mitigate them?
  4. Eval scores improved 0.3% — is that significant?
  5. How would you detect benchmark contamination in your training data?
  6. What's prompt injection and how do you defend against it?
  7. Difference between refusal and over-refusal — how do you track both?
  8. What's pass@k in HumanEval and why is it useful?
  9. Design an eval gate for a fine-tuning pipeline.
  10. How do you compare two models statistically? (McNemar / Wilson.)
  11. Your safety eval shows 99% refusal but users complain it refuses too much. What now?
  12. Compare MT-Bench, AlpacaEval, and Chatbot Arena — what does each measure?

10. From solid → exceptional

  • Reproduce three benchmark numbers from a real model card (e.g., Llama-3 8B's MMLU and GSM8K). Match within 1%.
  • Build a small LLM-judge harness: pairwise comparison with order swap, position-bias mitigation, validated against 100 human labels.
  • Implement n-gram contamination detection on a small corpus vs MMLU. Report % overlap.
  • Run a GCG attack (or a published variant) on a small open model; document refusal-rate before/after.
  • Build a shadow-eval pipeline that scores production traffic continuously and alerts on drift.
  • Read all three Anthropic safety / RSP documents and write a one-page operational summary.
  • Run a red-team session against your own RAG service from Phase 7; document every successful jailbreak.

DayActivity
MonRead HELM + Zheng et al. Judging LLM-as-a-Judge
TueRead Carlini memorization paper + GCG attack paper
WedLab 01 — implement HellaSwag eval; reproduce harness numbers
ThuBuild a small LLM-judge with 50 manual labels; compute κ
FriImplement Wilson CI + McNemar's test as utility scripts
SatSkim Anthropic RSP + OpenAI Preparedness Framework
SunMock interview the 12 questions; whiteboard threat models