Frontier labs spend a huge fraction of their engineering time on evaluation infrastructure — because you cannot ship a model you cannot measure, and you cannot iterate without a regression bar. "Model Evaluation Engineer" is now a dedicated job title at Anthropic, OpenAI, and Cohere.
By the end you will have built a real eval harness, an LLM-as-judge with bias controls, and a red-team report.
Implement a working eval harness covering 3 benchmarks; reproduce published numbers within 1 point.
Concepts
Likelihood scoring, prompt formatting, batch eval, result caching.
Steps
1) Implement MMLU (likelihood-based MCQ via per-option logprobs). 2) Implement HellaSwag (same structure). 3) Implement GSM8K (generation + answer extraction with regex). 4) Run on a 7B base model. 5) Compare to published HF leaderboard numbers.
Stack
transformers, datasets, vllm (optional, for speed)
Datasets
cais/mmlu, Rowan/hellaswag, gsm8k
Output
A reproducible CLI: eval.py --model <hf-id> --tasks mmlu,hellaswag,gsm8k.
How to Test
Reproduce Llama-3-8B published scores within ±1 point.
"Built an LLM evaluation harness covering MMLU/HellaSwag/GSM8K (likelihood + generation modes); reproduced published Llama-3-8B benchmark numbers within ±1 point with bootstrap CIs."
Extensions
Contribute a new task to EleutherAI/lm-evaluation-harness.
1) Pick 30 prompts; generate responses from 3 models. 2) Use a strong judge (GPT-4 / Claude) for pairwise comparison. 3) Compute Elo ratings. 4) Quantify position bias (how often does the first response win?). 5) Mitigate via swap-and-average.
Stack
OpenAI / Anthropic API; or local Llama-3-70B via Together
Datasets
MT-Bench prompts (free)
Output
An Elo leaderboard + a bias-mitigation report.
How to Test
Position-bias delta between raw and swap-averaged scores.
Talking Points
Why LLM judges are biased. When to use them anyway. Length-bias remediation.
Resume Bullet
"Implemented an MT-Bench-style pairwise LLM-as-judge harness with swap-position bias mitigation, producing Elo rankings across 3 candidate models with bootstrap confidence intervals."
Extensions
Add ChatBot-Arena-style crowd-eval simulation; correlate with human ratings.
1) Build a 50-question eval set for your Phase 7 corpus. 2) Run pipeline → record (query, contexts, answer, ground_truth). 3) Run RAGAS metrics. 4) Tune chunking / retrieval and observe metric movement.
Stack
ragas, your Phase 7 RAG service
Output
A 4×N metrics table + an ablation report (chunking size, k, re-ranker on/off).
How to Test
Faithfulness should drop when you raise temperature; context recall should rise with k.
Talking Points
Why faithfulness ≠ answer relevance. Why context precision matters for cost. The eval-set-creation challenge.
Resume Bullet
"Integrated RAGAS faithfulness/relevance/precision/recall metrics into a production RAG pipeline; ran 6 ablations (chunking × top-k × rerank) producing a quantified design-decision table."
Extensions
Add LLM-judge calibration (compare with human ratings on 30 examples).
1) Curate 50 adversarial prompts across 5 categories. 2) Measure attack success rate vs base model and vs SFT model. 3) Add an input classifier (Llama-Guard or a custom small classifier). 4) Measure ASR drop.
Stack
meta-llama/Llama-Guard-3-8B, your fine-tuned model from Phase 6
Datasets
AdvBench, your own
Output
A red-team report (categorized attack examples, ASR before/after filter).
How to Test
ASR meaningfully drops with the safety filter; over-refusal rate stays acceptable.
Talking Points
The over-refusal problem (false positives degrade utility). Why filters > training-time refusal-only.
Resume Bullet
"Conducted structured red-team across 5 jailbreak categories (50 prompts); reduced attack-success rate from 64% to 11% by adding a Llama-Guard input classifier with quantified over-refusal tradeoff."
Extensions
Train a custom small safety classifier on collected attack data.