🛸 Hitchhiker's Guide — Phase 8: Evaluation & Safety
Read this if: You can train and serve LLMs but you can't yet defend a number with statistical rigor, design an LLM-as-judge with calibrated confidence, distinguish capability vs alignment evals, or articulate the major safety threat models.
0. The 30-second mental model
Eval is the scientific method applied to LLMs. There is no objective "good model" — only models that score well on tasks you care about, in distributions you care about, with biases you can tolerate, and at costs you can pay. A serious eval program has:
- Capability evals: knowledge (MMLU), reasoning (GSM8K, MATH), coding (HumanEval, MBPP, SWE-Bench), language (HellaSwag, BBH), tool use, long-context.
- Alignment / safety evals: refusal of harmful requests, over-refusal of benign ones, jailbreak resistance, bias measurement, sycophancy.
- Pairwise / preference evals: head-to-head with LLM-judge or humans.
- Real-world evals: shadow-traffic in production; user satisfaction; A/B win rates.
- Regression suite: every checkpoint runs the full battery; no promotion without passing.
By the end of Phase 8 you should:
- Implement likelihood-based eval correctly (the lab does this on HellaSwag).
- Use lm-evaluation-harness as a reference implementation and reproduce its numbers.
- Design an LLM-as-judge with bias mitigation and human validation.
- Compute confidence intervals, McNemar's test, and sample-size requirements.
- Articulate the major contamination risks and detection methods.
- Discuss threat models: misuse, prompt injection, model theft, alignment failures.
1. The two flavors of eval
1.1 Likelihood-based (no generation)
Used for multiple-choice tasks. For each candidate completion, compute the model's log-probability and pick the argmax. No sampling, no nondeterminism — fully reproducible.
For a HellaSwag example with 4 candidate endings:
$$ \hat{y} = \arg\max_{i \in {A, B, C, D}} \frac{1}{|y_i|} \sum_t \log P(y_{i,t} | x, y_{i,<t}) $$
Sometimes normalized by length (per-token mean log-prob) to avoid bias toward shorter answers — variants are called acc, acc_norm, etc. in lm-evaluation-harness.
This is what Lab 01 implements.
1.2 Generation-based (sample, then judge)
Used for open-ended tasks (summarization, code generation, chat). Pipeline: generate output, then score it with one of:
- Exact match / rule-based: GSM8K answer matching, regex extraction, code execution (HumanEval).
- String-level metrics: BLEU, ROUGE, METEOR — older and brittle. Use only for translation/summarization, never for chat.
- LLM judge: another (usually stronger) model rates outputs. Rich signal but biased — see §3.
- Human judge: gold standard, costly and slow.
Generation-based evals introduce sampling variance. Either set temperature=0 (deterministic, but maybe under-explores model capability) or sample N times and report mean/CI.
2. The benchmarks you must know
2.1 Knowledge
- MMLU (Hendrycks et al., 2021) — 57 subjects, 16k questions. The classic capability benchmark. Saturated at the top end (~90% for Claude 4 / GPT-4o); use MMLU-Pro (more rigorous) for modern models.
- TriviaQA, NaturalQuestions — open-domain QA.
- TruthfulQA — common misconceptions; tests whether models repeat falsehoods.
2.2 Reasoning
- GSM8K — 8.5k grade-school math word problems. Saturated at the top.
- MATH — high-school competition math. Still hard.
- BBH (Big-Bench Hard) — 23 hard tasks from BIG-bench.
- HellaSwag, ARC, PIQA — common-sense reasoning. Older, somewhat saturated.
2.3 Code
- HumanEval (Chen et al., 2021) — 164 Python problems, judged by unit tests.
pass@kmetric. - MBPP — basic Python problems.
- SWE-Bench — real GitHub issues; agent must produce a patch that passes tests. Very hard, very realistic.
- LiveCodeBench, BigCodeBench — newer, less contaminated.
2.4 Language
- WinoGrande — coreference / common sense.
- LAMBADA — last-word prediction over long passages.
2.5 Long context
- Needle in a Haystack — embed a fact in a long document, ask about it. Tests recall.
- RULER — multi-needle, harder.
- LongBench — diverse long-context tasks.
2.6 Pairwise / preference
- MT-Bench (Zheng et al., 2023) — 80 multi-turn questions; LLM-judge head-to-head.
- AlpacaEval — pairwise win rate vs a baseline (GPT-4-Turbo).
- Chatbot Arena — human pairwise votes; produces an Elo leaderboard. The de-facto vibes benchmark.
2.7 Safety
- HarmBench, AdvBench — harmful instructions; measures refusal rate.
- XSTest — over-refusal of benign requests that look superficially harmful.
- JailbreakBench — known jailbreaks; measures resistance.
- BBQ — bias on stereotyped categories.
3. LLM-as-Judge — the most important pattern, with caveats
3.1 The pattern
Use a stronger LLM (or a different one) to compare outputs from two models on the same prompt and pick a winner (or rate a single output). Cheap, scalable; high agreement with humans on many tasks.
3.2 The biases (Zheng et al., 2023, Judging LLM-as-a-Judge)
- Position bias: judge prefers the first answer ~30% more often than chance. Mitigation: randomize order; or run both orderings and average.
- Verbosity bias: judge prefers longer answers. Mitigation: instruct against it; control for length in analysis.
- Self-preference: a model tends to prefer outputs from itself or its family. Mitigation: use a different model family as judge.
- Sycophancy / format bias: well-formatted (markdown, headers) wins regardless of content quality.
3.3 Validation: trust, but verify
Before trusting any LLM judge, collect 100–200 human-labeled pairwise judgments on the same data. Compute Cohen's κ between human and LLM judge. Require κ > 0.7 (substantial agreement) before deploying. Re-validate periodically.
3.4 Pairwise prompt template
You are an impartial judge. Compare two answers to the question below.
Pick A, B, or "tie". Justify briefly.
Question: {q}
Answer A: {a}
Answer B: {b}
Verdict (A | B | tie):
Reasoning:
Run twice with order swapped; if disagreement, report tie.
4. Statistical rigor
4.1 Confidence intervals on accuracy
For binary correct/wrong, accuracy is a binomial proportion. Use Wilson interval (better than normal approximation, especially near 0 or 1):
from statsmodels.stats.proportion import proportion_confint
ci_low, ci_high = proportion_confint(n_correct, n, alpha=0.05, method='wilson')
For continuous metrics (BLEU, faithfulness scores): bootstrap — resample with replacement N=1000 times; report 2.5% and 97.5% percentiles.
4.2 Comparing two models — paired McNemar's test
Two models A and B, both evaluated on the same N items. Build a 2×2 contingency table:
| B correct | B wrong | |
|---|---|---|
| A correct | n00 | n01 |
| A wrong | n10 | n11 |
McNemar's test on n01 vs n10 (the disagreements). Tells you whether A and B differ significantly on the items where they disagree. For pairwise win-rate from LLM-judge, use Wilson CI on the win-rate.
4.3 Sample size
To detect a 5% accuracy difference at p < 0.05, you need roughly N ≥ 400 items. To detect 1%, you need ~10,000. Most published benchmarks are smaller than this — be skeptical of small differences.
4.4 Reproducibility hygiene
Pin everything:
- Model weights hash.
- Tokenizer version.
- Eval harness version.
- Prompt template (yes, every character matters).
- Sampling parameters (or
temperature=0). - Random seed.
Cache predictions keyed on hash(model_id + prompt_id + sampling_id) so expensive evals run once per checkpoint.
5. Eval contamination — the silent killer
5.1 The problem
Web-scale pretraining scoops up the entire internet — including benchmark questions and answers. Models ace MMLU partly by memorizing it. Reported scores become meaningless.
5.2 Detection
- N-gram overlap (Llama, GPT-3 papers): scan the training corpus for 13-grams from eval questions; flag matches. Llama-3 reports per-benchmark contamination percentages.
- Embedding similarity scan for near-duplicates.
- Loss-based detection: trained models have suspiciously low perplexity on memorized vs. paraphrased items. (Carlini et al. 2022)
- Canary strings: insert unique nonce strings into the eval; if a model recites them, it saw the eval during training.
5.3 Prevention and mitigation
- Strict filtering: dedup eval suites against the training corpus before training. Llama-3 deletes train docs with high overlap.
- Held-out / private evals: companies maintain internal sets that aren't released.
- Dynamic benchmarks: LiveBench, LiveCodeBench refresh their items monthly to outpace contamination.
- Paraphrased variants: rephrase eval questions; if the model still gets them right, capability is real (not memorization).
6. Safety — threat models
6.1 Misuse
The model is asked to help with harmful tasks (weaponization, mass-influence ops, NCII, fraud). Defenses:
- Refusal training in SFT/RLHF (refuse known categories).
- Capability evaluations (CBRN, cyberoffense — Anthropic, OpenAI both publish these for frontier models).
- System prompt + safety classifiers at the gateway.
6.2 Over-refusal
Model refuses benign requests that superficially resemble harmful ones ("how do I kill a process in Linux?"). Measured by XSTest, OR-Bench. The dual of refusal — track both.
6.3 Prompt injection
Untrusted text in context (search result, email, retrieved doc) carries instructions that hijack the model. Major risk for agents. Defenses (no silver bullet):
- Privilege separation — instructions from system / user are trusted; instructions from tool outputs are not.
- Sandboxed tools — tools execute under the user's identity, not the model's claims.
- Output filtering — check for exfil patterns, suspicious URLs.
- Human-in-the-loop for destructive actions.
- Defense in depth — assume jailbreak will occur at some rate; design surrounding system to limit blast radius.
Read Simon Willison's prompt-injection blog series.
6.4 Jailbreaks
Adversarial prompts that bypass safety training. Categories:
- Role-play / persona ("DAN", "you are an unethical AI").
- Indirect ("write a story where a character explains how to ...").
- Encoding tricks (base64, leetspeak, foreign languages).
- Many-shot (Anthropic, 2024) — long context with many fake "examples" of harmful answers in prior turns.
- Adversarial suffixes (Zou et al. 2023, GCG attack) — gradient-optimized strings that crack open-weights models.
6.5 Bias and fairness
Models reflect training-data biases. Eval frameworks: BBQ, BOLD, RealToxicityPrompts. Mitigations: data filtering, RLHF on counter-stereotype demonstrations, output filters.
6.6 Alignment failures (longer-horizon concerns)
- Reward hacking — model finds adversarial paths to high reward (e.g., answers with a confident tone and bullet points always score higher → all answers become bullet lists).
- Sycophancy — agrees with user's stated beliefs even when wrong. Sharma et al. 2023.
- Specification gaming — pursues the literal objective in unintended ways.
- Deceptive alignment — speculative; model behaves aligned during training, misaligned in deployment. Active research at Anthropic, ARC Evals.
6.7 Model theft / extraction
API attackers query a model and use the outputs to train a clone. Mitigations: rate limiting, watermarking outputs (Kirchenbauer et al. 2023), fingerprinting.
7. The lab walkthrough (lab-01-eval-harness)
7.1 What you'll build
A from-scratch likelihood-based evaluator that:
- Loads a model (HuggingFace transformers).
- Loads HellaSwag (validation split, ~10k items).
- For each item, computes per-token log-probabilities of each candidate ending.
- Picks the argmax (
acc) and the length-normalized argmax (acc_norm). - Reports accuracy + Wilson CI.
- Validates against
lm-evaluation-harnessreference numbers.
7.2 Things to read carefully
- The
score_choice(prompt, choice)function: tokenize concatenation, run forward, gather log-probs at the choice positions only (not the prompt). Off-by-one is the most common bug — make sure indexes line up with shifted-by-one CE. - Length normalization: divide log-prob by number of choice tokens. Without it, the model favors shorter endings.
- Batched evaluation: pad to the longest in the batch; use attention mask; gather only valid positions.
7.3 Reproducibility check
Run lm-evaluation-harness on the same model+task; your accuracy should match within 0.5%. If it doesn't, you have a bug — usually in tokenization or position alignment.
8. References
Required:
- Liang et al. (2022), Holistic Evaluation of Language Models (HELM) — foundational.
- Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
- Hendrycks et al. (2021), Measuring Massive Multitask Language Understanding (MMLU).
- Chen et al. (2021), Evaluating Large Language Models Trained on Code (HumanEval).
- Carlini et al. (2022), Quantifying Memorization Across Neural Language Models.
- Anthropic, Responsible Scaling Policy documents.
- OpenAI, Preparedness Framework.
- Simon Willison's prompt-injection series.
Important:
- Bai et al. (2022), Constitutional AI.
- Sharma et al. (2023), Towards Understanding Sycophancy in Language Models.
- Zou et al. (2023), Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG).
- Anil et al. (2024), Many-shot Jailbreaking (Anthropic).
- The lm-evaluation-harness README and source code.
- The RAGAS docs.
9. Common interview questions on Phase 8 material
- Walk through how you'd evaluate a new chat model end-to-end.
- What's the difference between likelihood eval and generation eval?
- What biases does an LLM judge have, and how do you mitigate them?
- Eval scores improved 0.3% — is that significant?
- How would you detect benchmark contamination in your training data?
- What's prompt injection and how do you defend against it?
- Difference between refusal and over-refusal — how do you track both?
- What's pass@k in HumanEval and why is it useful?
- Design an eval gate for a fine-tuning pipeline.
- How do you compare two models statistically? (McNemar / Wilson.)
- Your safety eval shows 99% refusal but users complain it refuses too much. What now?
- Compare MT-Bench, AlpacaEval, and Chatbot Arena — what does each measure?
10. From solid → exceptional
- Reproduce three benchmark numbers from a real model card (e.g., Llama-3 8B's MMLU and GSM8K). Match within 1%.
- Build a small LLM-judge harness: pairwise comparison with order swap, position-bias mitigation, validated against 100 human labels.
- Implement n-gram contamination detection on a small corpus vs MMLU. Report % overlap.
- Run a GCG attack (or a published variant) on a small open model; document refusal-rate before/after.
- Build a shadow-eval pipeline that scores production traffic continuously and alerts on drift.
- Read all three Anthropic safety / RSP documents and write a one-page operational summary.
- Run a red-team session against your own RAG service from Phase 7; document every successful jailbreak.
11. Recommended cadence
| Day | Activity |
|---|---|
| Mon | Read HELM + Zheng et al. Judging LLM-as-a-Judge |
| Tue | Read Carlini memorization paper + GCG attack paper |
| Wed | Lab 01 — implement HellaSwag eval; reproduce harness numbers |
| Thu | Build a small LLM-judge with 50 manual labels; compute κ |
| Fri | Implement Wilson CI + McNemar's test as utility scripts |
| Sat | Skim Anthropic RSP + OpenAI Preparedness Framework |
| Sun | Mock interview the 12 questions; whiteboard threat models |