Lab 01 — Eval Harness for MCQ Tasks (Solution Walkthrough)

Phase: 8 — Evaluation & Safety | Difficulty: ⭐⭐⭐☆☆ | Time: 2–4 hours

Concept primer: ../HITCHHIKERS-GUIDE.md §Evaluation, §Likelihood scoring.

Run

pip install -r requirements.txt
python solution.py --model gpt2 --task hellaswag --limit 200

0. The mission

Implement a likelihood-based MCQ evaluator from scratch and validate that your numbers match lm-evaluation-harness (the de-facto standard used by every model leaderboard).

The point: when you read "GPT-X scored 87.3 on MMLU", you should know exactly how that number was produced — because there are a dozen ways to score MCQ tasks and they don't agree. The most cited setup is continuation log-likelihood: score log P(choice | context) for each option and pick the highest.

You will reproduce GPT-2's HellaSwag score (≈ 0.29 accuracy, near random for a 4-way task) and feel why bigger models matter.

1. The likelihood score — the canonical formulation

For a question with context $c$ and candidate continuations ${a_1, \ldots, a_K}$:

$$ \hat{a} = \arg\max_k \sum_{t=1}^{|a_k|} \log P_\theta(a_k^{(t)} \mid c, a_k^{(<t)}) $$

That is: concatenate (context, choice), run the model, sum the log-probs of the choice tokens only (not the context tokens), pick the highest-scoring choice.

Why sum, not mean?

Using mean (length-normalized) penalizes longer choices less. HellaSwag uses sum because the choices are roughly equal length and the unnormalized log-likelihood is what the model directly outputs. MMLU uses just the next-token log-prob over " A", " B", " C", " D" because choices are single letters.

Length normalization variants

None (sum): HellaSwag, ARC. Default.
Per-token (mean): some StoryCloze setups.
Per-byte: Pile-style perplexity comparison across tokenizers.
Single-token MCQ: MMLU — score only the " X" letter token. Much faster but tokenizer-dependent (BPE quirks around leading spaces matter).

This lab implements sum (the HellaSwag setup) and single-token MCQ (the MMLU setup) so you've seen both.

2. Loading the dataset

from datasets import load_dataset
ds = load_dataset("hellaswag", split="validation").select(range(args.limit))

A HellaSwag example:

{
    "ctx": "A man is sitting on a roof. He",
    "endings": [
        "is using wrap to wrap a pair of skis.",
        "is ripping level tiles from the roof.",
        "is holding a rake.",
        "is using a paint roller to paint the roof.",  # correct
    ],
    "label": "3",
}

Note: label is a string in HF's HellaSwag, not an int. Cast with int(ex["label"]).

3. Computing per-choice log-likelihood

@torch.no_grad()
def score_choice(model, tokenizer, context: str, choice: str) -> float:
    ctx_ids = tokenizer.encode(context, add_special_tokens=False)
    full_ids = tokenizer.encode(context + " " + choice, add_special_tokens=False)
    choice_ids = full_ids[len(ctx_ids):]                # 👈 the choice tokens

    input_ids = torch.tensor([full_ids], device=model.device)
    logits = model(input_ids).logits[0]                  # (T, V)

    # logits at position t predict token at t+1, so we shift
    log_probs = F.log_softmax(logits[:-1], dim=-1)       # (T-1, V)
    targets = input_ids[0, 1:]                           # (T-1,)
    token_lls = log_probs.gather(-1, targets.unsqueeze(-1)).squeeze(-1)

    # Keep only the log-likelihoods of the choice tokens
    n_ctx = len(ctx_ids)
    return token_lls[n_ctx-1:].sum().item()

The two subtleties that trip up everyone:

3.1 The off-by-one shift

A decoder LM at position $t$ predicts the token at position $t+1$. So logits[t] is the distribution over token[t+1]. To get log-prob of target[i], you look at logits[i-1]. We implement this by log_softmax(logits[:-1]) and targets = input_ids[0, 1:] — standard idiom.

3.2 The `n_ctx-1` slice

The context's $n_\text{ctx}$ tokens occupy positions 0..n_ctx-1 in input_ids. The choice tokens occupy n_ctx..T-1. After the shift, token_lls[i] is the log-prob of input_ids[i+1]. So choice tokens' log-probs are at token_lls[n_ctx-1 : T-1] — i.e., starting at index n_ctx - 1.

Getting this off by one shifts the score by one token's log-prob and silently changes accuracy by 1–3%. The way to verify is: when you sum over the entire sequence (set n_ctx=0), the result should equal total_loss * T (with sign flip). Always sanity-check this first.

4. Per-example evaluation

def evaluate_hellaswag(model, tokenizer, ds):
    correct = 0
    for ex in tqdm(ds):
        scores = [score_choice(model, tokenizer, ex["ctx"], ending) for ending in ex["endings"]]
        pred = int(np.argmax(scores))
        if pred == int(ex["label"]):
            correct += 1
    return correct / len(ds)

For K choices and N examples, you do N × K forward passes. HellaSwag has K=4 → 4× the cost of a single pass. For 200 examples on GPT-2 small, ~30 seconds on a 4090.

Optimization: batch all K choices for one example into one forward pass with padding. For very large evaluations (10k+ MMLU questions × 4 choices), this is a 4× speedup.

5. Single-token MCQ (the MMLU setup)

def evaluate_mmlu(model, tokenizer, ds):
    correct = 0
    for ex in tqdm(ds):
        prompt = format_mmlu_prompt(ex)              # ends with "Answer:"
        ids = tokenizer.encode(prompt, return_tensors="pt").to(model.device)
        with torch.no_grad():
            logits = model(ids).logits[0, -1]         # last position only
        # Compare log P(" A"), log P(" B"), log P(" C"), log P(" D")
        choice_ids = [tokenizer.encode(" " + L, add_special_tokens=False)[0]
                      for L in ["A", "B", "C", "D"]]
        scores = logits[choice_ids]
        pred = ["A", "B", "C", "D"][scores.argmax().item()]
        if pred == ex["answer"]:
            correct += 1
    return correct / len(ds)

Key points:

Only the last logit matters — we compare the model's distribution over the next token.
Leading space matters. tokenizer.encode(" A") and tokenizer.encode("A") produce different IDs in BPE tokenizers. Always include the space the way the prompt does.
5-shot MMLU is standard: prepend 5 example Q&A pairs from the dev split before the test question. Massively boosts scores; the format-following matters.

6. The MMLU prompt template

def format_mmlu_prompt(ex):
    return (
        f"The following is a multiple choice question.\n\n"
        f"Question: {ex['question']}\n"
        f"A) {ex['choices'][0]}\n"
        f"B) {ex['choices'][1]}\n"
        f"C) {ex['choices'][2]}\n"
        f"D) {ex['choices'][3]}\n"
        f"Answer:"
    )

Different harnesses use different templates (some use "The answer is", some omit the labels). Reported scores depend on the template. This is why you can't directly compare numbers from different papers without checking the eval setup.

7. Expected output

[hellaswag] gpt2 (124M)   acc=0.292   n=200
[hellaswag] gpt2-medium   acc=0.339   n=200
[hellaswag] gpt2-large    acc=0.366   n=200

Sanity calibration (from the lm-eval-harness leaderboard):

Model	HellaSwag (norm acc)	MMLU 5-shot
Random	0.25	0.25
GPT-2 124M	0.29–0.31	~0.26 (basically random)
GPT-2 large	0.36	~0.27
Llama-2-7B	0.78	0.46
Llama-3-8B	0.82	0.66
GPT-4	0.95	0.86

If your number is more than ~2 percentage points off the published value, you have a bug. Most common bugs: off-by-one in the slice, wrong tokenizer for the leading space, missing newlines in the template.

8. Why this matters for safety / alignment work

MCQ evals like MMLU are proxies for capability. For safety, you also need:

Refusal evals — does the model refuse harmful requests? (Built similarly: score the model's response, classify with a separate judge.)
Jailbreak robustness — does the model refuse even with adversarial prompts?
Truthfulness — TruthfulQA (multiple-choice, set up like MMLU but specifically targets common misconceptions).
Bias — BBQ, CrowS-Pairs.
LLM-as-judge evals (MT-Bench, AlpacaEval) for free-form responses — use a strong model to score.

This lab's mechanics (likelihood scoring + tokenizer care + template control) are the foundation for every one of those.

9. Common pitfalls

Off-by-one in the slice — silent 1–3% accuracy drift.
Forgetting the leading space in single-token MCQ — you score the wrong token IDs entirely.
Not normalizing by length when choices vary wildly in length — longer choices look worse purely from cumulative log-prob.
Using logits[:, -1] for the entire sequence instead of slicing per-position — you'd score only the last token's correctness instead of every choice token.
Tokenizer mismatch — using GPT-2 tokenizer to encode for a Llama model. Always AutoTokenizer.from_pretrained(model_id).
Not setting model.eval() — dropout activates, scores become non-deterministic.

10. Stretch exercises

Add length-normalized scoring as a flag; compare HellaSwag accuracy with/without. The leaderboard reports acc_norm (length-normalized) which is usually 2–5 points higher than acc.
Implement few-shot MMLU: prepend 5 dev examples; compare to 0-shot.
Cross-validate against lm-eval-harness: install it, run the same model+task, confirm your numbers match within 0.5%.
Add GSM8K: free-form generation + answer extraction (regex ####\s*(-?\d+)). Different evaluation paradigm — generative not likelihood.
Implement a refusal eval: a small set of harmful prompts; score whether the model output starts with refusal phrases ("I can't", "I won't", "As an AI"). Compare a base model to its instruction-tuned version.
Profile inference cost: how many GPU-hours to evaluate Llama-3-8B on full MMLU (14k questions)? Compare batched vs unbatched.

11. What this lab proves about you

You understand exactly what's behind the numbers in every model paper. You can build a custom eval for a new task in an hour. You can debug a 1% accuracy discrepancy by tracing through tokenization → slicing → scoring. This is the bar for Phase-8 — and the entry point to alignment & evaluation engineering roles at Anthropic, OpenAI, DeepMind.

LLM Inference Engineer