🛸 Hitchhiker's Guide — Phase 8: Evaluation & Safety

Read this if: You can train and serve LLMs but you can't yet defend a number with statistical rigor, design an LLM-as-judge with calibrated confidence, distinguish capability vs alignment evals, or articulate the major safety threat models.

0. The 30-second mental model

Eval is the scientific method applied to LLMs. There is no objective "good model" — only models that score well on tasks you care about, in distributions you care about, with biases you can tolerate, and at costs you can pay. A serious eval program has:

Capability evals: knowledge (MMLU), reasoning (GSM8K, MATH), coding (HumanEval, MBPP, SWE-Bench), language (HellaSwag, BBH), tool use, long-context.
Alignment / safety evals: refusal of harmful requests, over-refusal of benign ones, jailbreak resistance, bias measurement, sycophancy.
Pairwise / preference evals: head-to-head with LLM-judge or humans.
Real-world evals: shadow-traffic in production; user satisfaction; A/B win rates.
Regression suite: every checkpoint runs the full battery; no promotion without passing.

By the end of Phase 8 you should:

Implement likelihood-based eval correctly (the lab does this on HellaSwag).
Use lm-evaluation-harness as a reference implementation and reproduce its numbers.
Design an LLM-as-judge with bias mitigation and human validation.
Compute confidence intervals, McNemar's test, and sample-size requirements.
Articulate the major contamination risks and detection methods.
Discuss threat models: misuse, prompt injection, model theft, alignment failures.

1. The two flavors of eval

1.1 Likelihood-based (no generation)

Used for multiple-choice tasks. For each candidate completion, compute the model's log-probability and pick the argmax. No sampling, no nondeterminism — fully reproducible.

For a HellaSwag example with 4 candidate endings:

$$ \hat{y} = \arg\max_{i \in {A, B, C, D}} \frac{1}{|y_i|} \sum_t \log P(y_{i,t} | x, y_{i,<t}) $$

Sometimes normalized by length (per-token mean log-prob) to avoid bias toward shorter answers — variants are called acc, acc_norm, etc. in lm-evaluation-harness.

This is what Lab 01 implements.

1.2 Generation-based (sample, then judge)

Used for open-ended tasks (summarization, code generation, chat). Pipeline: generate output, then score it with one of:

Exact match / rule-based: GSM8K answer matching, regex extraction, code execution (HumanEval).
String-level metrics: BLEU, ROUGE, METEOR — older and brittle. Use only for translation/summarization, never for chat.
LLM judge: another (usually stronger) model rates outputs. Rich signal but biased — see §3.
Human judge: gold standard, costly and slow.

Generation-based evals introduce sampling variance. Either set temperature=0 (deterministic, but maybe under-explores model capability) or sample N times and report mean/CI.

2. The benchmarks you must know

2.1 Knowledge

MMLU (Hendrycks et al., 2021) — 57 subjects, 16k questions. The classic capability benchmark. Saturated at the top end (~90% for Claude 4 / GPT-4o); use MMLU-Pro (more rigorous) for modern models.
TriviaQA, NaturalQuestions — open-domain QA.
TruthfulQA — common misconceptions; tests whether models repeat falsehoods.

2.2 Reasoning

GSM8K — 8.5k grade-school math word problems. Saturated at the top.
MATH — high-school competition math. Still hard.
BBH (Big-Bench Hard) — 23 hard tasks from BIG-bench.
HellaSwag, ARC, PIQA — common-sense reasoning. Older, somewhat saturated.

2.3 Code

HumanEval (Chen et al., 2021) — 164 Python problems, judged by unit tests. pass@k metric.
MBPP — basic Python problems.
SWE-Bench — real GitHub issues; agent must produce a patch that passes tests. Very hard, very realistic.
LiveCodeBench, BigCodeBench — newer, less contaminated.

2.4 Language

WinoGrande — coreference / common sense.
LAMBADA — last-word prediction over long passages.

2.5 Long context

Needle in a Haystack — embed a fact in a long document, ask about it. Tests recall.
RULER — multi-needle, harder.
LongBench — diverse long-context tasks.

2.6 Pairwise / preference

MT-Bench (Zheng et al., 2023) — 80 multi-turn questions; LLM-judge head-to-head.
AlpacaEval — pairwise win rate vs a baseline (GPT-4-Turbo).
Chatbot Arena — human pairwise votes; produces an Elo leaderboard. The de-facto vibes benchmark.

2.7 Safety

HarmBench, AdvBench — harmful instructions; measures refusal rate.
XSTest — over-refusal of benign requests that look superficially harmful.
JailbreakBench — known jailbreaks; measures resistance.
BBQ — bias on stereotyped categories.

3. LLM-as-Judge — the most important pattern, with caveats

3.1 The pattern

Use a stronger LLM (or a different one) to compare outputs from two models on the same prompt and pick a winner (or rate a single output). Cheap, scalable; high agreement with humans on many tasks.

3.2 The biases (Zheng et al., 2023, Judging LLM-as-a-Judge)

Position bias: judge prefers the first answer ~30% more often than chance. Mitigation: randomize order; or run both orderings and average.
Verbosity bias: judge prefers longer answers. Mitigation: instruct against it; control for length in analysis.
Self-preference: a model tends to prefer outputs from itself or its family. Mitigation: use a different model family as judge.
Sycophancy / format bias: well-formatted (markdown, headers) wins regardless of content quality.

3.3 Validation: trust, but verify

Before trusting any LLM judge, collect 100–200 human-labeled pairwise judgments on the same data. Compute Cohen's κ between human and LLM judge. Require κ > 0.7 (substantial agreement) before deploying. Re-validate periodically.

3.4 Pairwise prompt template

You are an impartial judge. Compare two answers to the question below.
Pick A, B, or "tie". Justify briefly.

Question: {q}
Answer A: {a}
Answer B: {b}

Verdict (A | B | tie):
Reasoning:

Run twice with order swapped; if disagreement, report tie.

4. Statistical rigor

4.1 Confidence intervals on accuracy

For binary correct/wrong, accuracy is a binomial proportion. Use Wilson interval (better than normal approximation, especially near 0 or 1):

from statsmodels.stats.proportion import proportion_confint
ci_low, ci_high = proportion_confint(n_correct, n, alpha=0.05, method='wilson')

For continuous metrics (BLEU, faithfulness scores): bootstrap — resample with replacement N=1000 times; report 2.5% and 97.5% percentiles.

4.2 Comparing two models — paired McNemar's test

Two models A and B, both evaluated on the same N items. Build a 2×2 contingency table:

	B correct	B wrong
A correct	n00	n01
A wrong	n10	n11

McNemar's test on n01 vs n10 (the disagreements). Tells you whether A and B differ significantly on the items where they disagree. For pairwise win-rate from LLM-judge, use Wilson CI on the win-rate.

4.3 Sample size

To detect a 5% accuracy difference at p < 0.05, you need roughly N ≥ 400 items. To detect 1%, you need ~10,000. Most published benchmarks are smaller than this — be skeptical of small differences.

4.4 Reproducibility hygiene

Pin everything:

Model weights hash.
Tokenizer version.
Eval harness version.
Prompt template (yes, every character matters).
Sampling parameters (or temperature=0).
Random seed.

Cache predictions keyed on hash(model_id + prompt_id + sampling_id) so expensive evals run once per checkpoint.

5. Eval contamination — the silent killer

5.1 The problem

Web-scale pretraining scoops up the entire internet — including benchmark questions and answers. Models ace MMLU partly by memorizing it. Reported scores become meaningless.

5.2 Detection

N-gram overlap (Llama, GPT-3 papers): scan the training corpus for 13-grams from eval questions; flag matches. Llama-3 reports per-benchmark contamination percentages.
Embedding similarity scan for near-duplicates.
Loss-based detection: trained models have suspiciously low perplexity on memorized vs. paraphrased items. (Carlini et al. 2022)
Canary strings: insert unique nonce strings into the eval; if a model recites them, it saw the eval during training.

5.3 Prevention and mitigation

Strict filtering: dedup eval suites against the training corpus before training. Llama-3 deletes train docs with high overlap.
Held-out / private evals: companies maintain internal sets that aren't released.
Dynamic benchmarks: LiveBench, LiveCodeBench refresh their items monthly to outpace contamination.
Paraphrased variants: rephrase eval questions; if the model still gets them right, capability is real (not memorization).

6. Safety — threat models

6.1 Misuse

The model is asked to help with harmful tasks (weaponization, mass-influence ops, NCII, fraud). Defenses:

Refusal training in SFT/RLHF (refuse known categories).
Capability evaluations (CBRN, cyberoffense — Anthropic, OpenAI both publish these for frontier models).
System prompt + safety classifiers at the gateway.

6.2 Over-refusal

Model refuses benign requests that superficially resemble harmful ones ("how do I kill a process in Linux?"). Measured by XSTest, OR-Bench. The dual of refusal — track both.

6.3 Prompt injection

Untrusted text in context (search result, email, retrieved doc) carries instructions that hijack the model. Major risk for agents. Defenses (no silver bullet):

Privilege separation — instructions from system / user are trusted; instructions from tool outputs are not.
Sandboxed tools — tools execute under the user's identity, not the model's claims.
Output filtering — check for exfil patterns, suspicious URLs.
Human-in-the-loop for destructive actions.
Defense in depth — assume jailbreak will occur at some rate; design surrounding system to limit blast radius.

Read Simon Willison's prompt-injection blog series.

6.4 Jailbreaks

Adversarial prompts that bypass safety training. Categories:

Role-play / persona ("DAN", "you are an unethical AI").
Indirect ("write a story where a character explains how to ...").
Encoding tricks (base64, leetspeak, foreign languages).
Many-shot (Anthropic, 2024) — long context with many fake "examples" of harmful answers in prior turns.
Adversarial suffixes (Zou et al. 2023, GCG attack) — gradient-optimized strings that crack open-weights models.

6.5 Bias and fairness

Models reflect training-data biases. Eval frameworks: BBQ, BOLD, RealToxicityPrompts. Mitigations: data filtering, RLHF on counter-stereotype demonstrations, output filters.

6.6 Alignment failures (longer-horizon concerns)

Reward hacking — model finds adversarial paths to high reward (e.g., answers with a confident tone and bullet points always score higher → all answers become bullet lists).
Sycophancy — agrees with user's stated beliefs even when wrong. Sharma et al. 2023.
Specification gaming — pursues the literal objective in unintended ways.
Deceptive alignment — speculative; model behaves aligned during training, misaligned in deployment. Active research at Anthropic, ARC Evals.

Loads a model (HuggingFace transformers).
Loads HellaSwag (validation split, ~10k items).
For each item, computes per-token log-probabilities of each candidate ending.
Picks the argmax (acc) and the length-normalized argmax (acc_norm).
Reports accuracy + Wilson CI.
Validates against lm-evaluation-harness reference numbers.

7.2 Things to read carefully

The score_choice(prompt, choice) function: tokenize concatenation, run forward, gather log-probs at the choice positions only (not the prompt). Off-by-one is the most common bug — make sure indexes line up with shifted-by-one CE.
Length normalization: divide log-prob by number of choice tokens. Without it, the model favors shorter endings.
Batched evaluation: pad to the longest in the batch; use attention mask; gather only valid positions.

7.3 Reproducibility check

Run lm-evaluation-harness on the same model+task; your accuracy should match within 0.5%. If it doesn't, you have a bug — usually in tokenization or position alignment.

8. References

Required:

Liang et al. (2022), Holistic Evaluation of Language Models (HELM) — foundational.
Zheng et al. (2023), Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena.
Hendrycks et al. (2021), Measuring Massive Multitask Language Understanding (MMLU).
Chen et al. (2021), Evaluating Large Language Models Trained on Code (HumanEval).
Carlini et al. (2022), Quantifying Memorization Across Neural Language Models.
Anthropic, Responsible Scaling Policy documents.
OpenAI, Preparedness Framework.
Simon Willison's prompt-injection series.

Important:

Bai et al. (2022), Constitutional AI.
Sharma et al. (2023), Towards Understanding Sycophancy in Language Models.
Zou et al. (2023), Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG).
Anil et al. (2024), Many-shot Jailbreaking (Anthropic).
The lm-evaluation-harness README and source code.
The RAGAS docs.

9. Common interview questions on Phase 8 material

Walk through how you'd evaluate a new chat model end-to-end.
What's the difference between likelihood eval and generation eval?
What biases does an LLM judge have, and how do you mitigate them?
Eval scores improved 0.3% — is that significant?
How would you detect benchmark contamination in your training data?
What's prompt injection and how do you defend against it?
Difference between refusal and over-refusal — how do you track both?
What's pass@k in HumanEval and why is it useful?
Design an eval gate for a fine-tuning pipeline.
How do you compare two models statistically? (McNemar / Wilson.)
Your safety eval shows 99% refusal but users complain it refuses too much. What now?
Compare MT-Bench, AlpacaEval, and Chatbot Arena — what does each measure?

10. From solid → exceptional

Reproduce three benchmark numbers from a real model card (e.g., Llama-3 8B's MMLU and GSM8K). Match within 1%.
Build a small LLM-judge harness: pairwise comparison with order swap, position-bias mitigation, validated against 100 human labels.
Implement n-gram contamination detection on a small corpus vs MMLU. Report % overlap.
Run a GCG attack (or a published variant) on a small open model; document refusal-rate before/after.
Build a shadow-eval pipeline that scores production traffic continuously and alerts on drift.
Read all three Anthropic safety / RSP documents and write a one-page operational summary.
Run a red-team session against your own RAG service from Phase 7; document every successful jailbreak.

11. Recommended cadence

Day	Activity
Mon	Read HELM + Zheng et al. Judging LLM-as-a-Judge
Tue	Read Carlini memorization paper + GCG attack paper
Wed	Lab 01 — implement HellaSwag eval; reproduce harness numbers
Thu	Build a small LLM-judge with 50 manual labels; compute κ
Fri	Implement Wilson CI + McNemar's test as utility scripts
Sat	Skim Anthropic RSP + OpenAI Preparedness Framework
Sun	Mock interview the 12 questions; whiteboard threat models

LLM Inference Engineer