Phase 8 — Evaluation & Safety

Difficulty: ⭐⭐⭐⭐☆ | Estimated Time: 1.5 weeks Roles supported: Model Evaluation Engineer, Safety Engineer, Research Engineer (eval is a research-engineering specialty).


Why This Phase Exists

Frontier labs spend a huge fraction of their engineering time on evaluation infrastructure — because you cannot ship a model you cannot measure, and you cannot iterate without a regression bar. "Model Evaluation Engineer" is now a dedicated job title at Anthropic, OpenAI, and Cohere.

By the end you will have built a real eval harness, an LLM-as-judge with bias controls, and a red-team report.


Concepts

  • Benchmarks: MMLU, HellaSwag, ARC, GSM8K, MATH, HumanEval, MBPP, IFEval, MT-Bench, AlpacaEval
  • Likelihood-based eval (multiple choice via logprobs) vs generation eval
  • Few-shot prompting & chain-of-thought
  • Perplexity — and why it's a poor proxy for downstream quality
  • LLM-as-judge: bias (position, length, self-bias), mitigations (pairwise + swap)
  • RAGAS: faithfulness, answer relevance, context precision/recall
  • HELM concepts: scenarios + metrics matrix
  • Red-teaming: jailbreak taxonomy (DAN, prompt injection, encoding attacks)
  • Safety classifiers: input/output filters, refusal rates
  • Eval-in-production: drift detection, A/B testing, shadow deploys
  • Statistical significance: bootstrap CIs over eval scores

Labs

Lab 01 — Build an Eval Harness (lm-eval-harness Style)

FieldValue
GoalImplement a working eval harness covering 3 benchmarks; reproduce published numbers within 1 point.
ConceptsLikelihood scoring, prompt formatting, batch eval, result caching.
Steps1) Implement MMLU (likelihood-based MCQ via per-option logprobs). 2) Implement HellaSwag (same structure). 3) Implement GSM8K (generation + answer extraction with regex). 4) Run on a 7B base model. 5) Compare to published HF leaderboard numbers.
Stacktransformers, datasets, vllm (optional, for speed)
Datasetscais/mmlu, Rowan/hellaswag, gsm8k
OutputA reproducible CLI: eval.py --model <hf-id> --tasks mmlu,hellaswag,gsm8k.
How to TestReproduce Llama-3-8B published scores within ±1 point.
Talking PointsWhy MMLU uses likelihood (no generation noise). Why GSM8K needs answer extraction. Why subtle prompt changes shift scores 5+ points.
Resume Bullet"Built an LLM evaluation harness covering MMLU/HellaSwag/GSM8K (likelihood + generation modes); reproduced published Llama-3-8B benchmark numbers within ±1 point with bootstrap CIs."
ExtensionsContribute a new task to EleutherAI/lm-evaluation-harness.

Lab 02 — LLM-as-Judge with Bias Controls

FieldValue
GoalBuild an MT-Bench-style judge; quantify and mitigate position/length bias.
ConceptsPairwise comparison, swap-position averaging, length normalization, self-bias.
Steps1) Pick 30 prompts; generate responses from 3 models. 2) Use a strong judge (GPT-4 / Claude) for pairwise comparison. 3) Compute Elo ratings. 4) Quantify position bias (how often does the first response win?). 5) Mitigate via swap-and-average.
StackOpenAI / Anthropic API; or local Llama-3-70B via Together
DatasetsMT-Bench prompts (free)
OutputAn Elo leaderboard + a bias-mitigation report.
How to TestPosition-bias delta between raw and swap-averaged scores.
Talking PointsWhy LLM judges are biased. When to use them anyway. Length-bias remediation.
Resume Bullet"Implemented an MT-Bench-style pairwise LLM-as-judge harness with swap-position bias mitigation, producing Elo rankings across 3 candidate models with bootstrap confidence intervals."
ExtensionsAdd ChatBot-Arena-style crowd-eval simulation; correlate with human ratings.

Lab 03 — RAG Evaluation with RAGAS

FieldValue
GoalPlug RAGAS into the Phase 7 RAG pipeline; report 4-axis quality metrics.
ConceptsFaithfulness, answer relevance, context precision, context recall.
Steps1) Build a 50-question eval set for your Phase 7 corpus. 2) Run pipeline → record (query, contexts, answer, ground_truth). 3) Run RAGAS metrics. 4) Tune chunking / retrieval and observe metric movement.
Stackragas, your Phase 7 RAG service
OutputA 4×N metrics table + an ablation report (chunking size, k, re-ranker on/off).
How to TestFaithfulness should drop when you raise temperature; context recall should rise with k.
Talking PointsWhy faithfulness ≠ answer relevance. Why context precision matters for cost. The eval-set-creation challenge.
Resume Bullet"Integrated RAGAS faithfulness/relevance/precision/recall metrics into a production RAG pipeline; ran 6 ablations (chunking × top-k × rerank) producing a quantified design-decision table."
ExtensionsAdd LLM-judge calibration (compare with human ratings on 30 examples).

Lab 04 — Red-Teaming & Safety Classifiers

FieldValue
GoalRun a structured red-team on a deployed model; build an input/output safety filter.
ConceptsJailbreak taxonomy, prompt injection, attack-success-rate, refusal calibration.
Steps1) Curate 50 adversarial prompts across 5 categories. 2) Measure attack success rate vs base model and vs SFT model. 3) Add an input classifier (Llama-Guard or a custom small classifier). 4) Measure ASR drop.
Stackmeta-llama/Llama-Guard-3-8B, your fine-tuned model from Phase 6
DatasetsAdvBench, your own
OutputA red-team report (categorized attack examples, ASR before/after filter).
How to TestASR meaningfully drops with the safety filter; over-refusal rate stays acceptable.
Talking PointsThe over-refusal problem (false positives degrade utility). Why filters > training-time refusal-only.
Resume Bullet"Conducted structured red-team across 5 jailbreak categories (50 prompts); reduced attack-success rate from 64% to 11% by adding a Llama-Guard input classifier with quantified over-refusal tradeoff."
ExtensionsTrain a custom small safety classifier on collected attack data.

Deliverables Checklist

  • Eval harness reproducing leaderboard numbers
  • LLM-as-judge with bias mitigation
  • RAGAS evaluation of Phase 7 system
  • Red-team report + safety filter

Interview Relevance

  • "How would you set up evals for an LLM project?"
  • "What are the failure modes of LLM-as-judge?"
  • "How do you catch regressions in production?"