Phase 8 — Evaluation & Safety

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 1.5 weeks Roles supported: Model Evaluation Engineer, Safety Engineer, Research Engineer (eval is a research-engineering specialty).

Why This Phase Exists

Frontier labs spend a huge fraction of their engineering time on evaluation infrastructure — because you cannot ship a model you cannot measure, and you cannot iterate without a regression bar. "Model Evaluation Engineer" is now a dedicated job title at Anthropic, OpenAI, and Cohere.

By the end you will have built a real eval harness, an LLM-as-judge with bias controls, and a red-team report.

Concepts

Benchmarks: MMLU, HellaSwag, ARC, GSM8K, MATH, HumanEval, MBPP, IFEval, MT-Bench, AlpacaEval
Likelihood-based eval (multiple choice via logprobs) vs generation eval
Few-shot prompting & chain-of-thought
Perplexity — and why it's a poor proxy for downstream quality
LLM-as-judge: bias (position, length, self-bias), mitigations (pairwise + swap)
RAGAS: faithfulness, answer relevance, context precision/recall
HELM concepts: scenarios + metrics matrix
Red-teaming: jailbreak taxonomy (DAN, prompt injection, encoding attacks)
Safety classifiers: input/output filters, refusal rates
Eval-in-production: drift detection, A/B testing, shadow deploys
Statistical significance: bootstrap CIs over eval scores

Labs

Lab 01 — Build an Eval Harness (lm-eval-harness Style)

Field	Value
Goal	Implement a working eval harness covering 3 benchmarks; reproduce published numbers within 1 point.
Concepts	Likelihood scoring, prompt formatting, batch eval, result caching.
Steps	1) Implement MMLU (likelihood-based MCQ via per-option logprobs). 2) Implement HellaSwag (same structure). 3) Implement GSM8K (generation + answer extraction with regex). 4) Run on a 7B base model. 5) Compare to published HF leaderboard numbers.
Stack	`transformers`, `datasets`, `vllm` (optional, for speed)
Datasets	`cais/mmlu`, `Rowan/hellaswag`, `gsm8k`
Output	A reproducible CLI: `eval.py --model <hf-id> --tasks mmlu,hellaswag,gsm8k`.
How to Test	Reproduce Llama-3-8B published scores within ±1 point.
Talking Points	Why MMLU uses likelihood (no generation noise). Why GSM8K needs answer extraction. Why subtle prompt changes shift scores 5+ points.
Resume Bullet	"Built an LLM evaluation harness covering MMLU/HellaSwag/GSM8K (likelihood + generation modes); reproduced published Llama-3-8B benchmark numbers within ±1 point with bootstrap CIs."
Extensions	Contribute a new task to `EleutherAI/lm-evaluation-harness`.

Lab 02 — LLM-as-Judge with Bias Controls

Field	Value
Goal	Build an MT-Bench-style judge; quantify and mitigate position/length bias.
Concepts	Pairwise comparison, swap-position averaging, length normalization, self-bias.
Steps	1) Pick 30 prompts; generate responses from 3 models. 2) Use a strong judge (GPT-4 / Claude) for pairwise comparison. 3) Compute Elo ratings. 4) Quantify position bias (how often does the first response win?). 5) Mitigate via swap-and-average.
Stack	OpenAI / Anthropic API; or local Llama-3-70B via Together
Datasets	MT-Bench prompts (free)
Output	An Elo leaderboard + a bias-mitigation report.
How to Test	Position-bias delta between raw and swap-averaged scores.
Talking Points	Why LLM judges are biased. When to use them anyway. Length-bias remediation.
Resume Bullet	"Implemented an MT-Bench-style pairwise LLM-as-judge harness with swap-position bias mitigation, producing Elo rankings across 3 candidate models with bootstrap confidence intervals."
Extensions	Add ChatBot-Arena-style crowd-eval simulation; correlate with human ratings.

Lab 03 — RAG Evaluation with RAGAS

Field	Value
Goal	Plug RAGAS into the Phase 7 RAG pipeline; report 4-axis quality metrics.
Concepts	Faithfulness, answer relevance, context precision, context recall.
Steps	1) Build a 50-question eval set for your Phase 7 corpus. 2) Run pipeline → record (query, contexts, answer, ground_truth). 3) Run RAGAS metrics. 4) Tune chunking / retrieval and observe metric movement.
Stack	`ragas`, your Phase 7 RAG service
Output	A 4×N metrics table + an ablation report (chunking size, k, re-ranker on/off).
How to Test	Faithfulness should drop when you raise temperature; context recall should rise with k.
Talking Points	Why faithfulness ≠ answer relevance. Why context precision matters for cost. The eval-set-creation challenge.
Resume Bullet	"Integrated RAGAS faithfulness/relevance/precision/recall metrics into a production RAG pipeline; ran 6 ablations (chunking × top-k × rerank) producing a quantified design-decision table."
Extensions	Add LLM-judge calibration (compare with human ratings on 30 examples).

Lab 04 — Red-Teaming & Safety Classifiers

Field	Value
Goal	Run a structured red-team on a deployed model; build an input/output safety filter.
Concepts	Jailbreak taxonomy, prompt injection, attack-success-rate, refusal calibration.
Steps	1) Curate 50 adversarial prompts across 5 categories. 2) Measure attack success rate vs base model and vs SFT model. 3) Add an input classifier (Llama-Guard or a custom small classifier). 4) Measure ASR drop.
Stack	`meta-llama/Llama-Guard-3-8B`, your fine-tuned model from Phase 6
Datasets	AdvBench, your own
Output	A red-team report (categorized attack examples, ASR before/after filter).
How to Test	ASR meaningfully drops with the safety filter; over-refusal rate stays acceptable.
Talking Points	The over-refusal problem (false positives degrade utility). Why filters > training-time refusal-only.
Resume Bullet	"Conducted structured red-team across 5 jailbreak categories (50 prompts); reduced attack-success rate from 64% to 11% by adding a Llama-Guard input classifier with quantified over-refusal tradeoff."
Extensions	Train a custom small safety classifier on collected attack data.

Deliverables Checklist

Eval harness reproducing leaderboard numbers
LLM-as-judge with bias mitigation
RAGAS evaluation of Phase 7 system
Red-team report + safety filter

Interview Relevance

"How would you set up evals for an LLM project?"
"What are the failure modes of LLM-as-judge?"
"How do you catch regressions in production?"

LLM Inference Engineer