05 — Research-Engineering Questions

Asked in pretraining / research-engineer interviews (Anthropic, OpenAI, DeepMind, Meta, xAI). Less coding, more "how would you debug / decide / measure".

A. Numerical Stability & Debugging

Q. Loss is NaN at step 500 of a previously-stable run. Walk me through diagnosis.

Snapshot the bad step's data + the prior 5 checkpoints.
Re-run from N-2 with deterministic mode + grad anomaly detection. Reproduce.
Find the first NaN: is it in activations (forward) or gradients (backward)?
Forward NaN → check for inf in logits (saturated softmax?), look at LayerNorm with zero variance, look at attention scores with all--inf row (mask bug).
Backward NaN → grad clipping not aggressive enough; AdamW eps too small; FP16/FP8 underflow.
Often: a single bad batch (very long sequence + repeated chars). Add data filtering or grad norm spike detector → skip + log + alert.

Q. Loss looks fine but eval is regressing. What's happening?

Possibilities:

Train/eval distribution mismatch
Memorization of train (overfitting) → check train loss vs eval loss curves
Eval contamination (train data leaked into eval)
Tokenizer mismatch between train and eval prompts
Wrong eval prompt template (chat models very sensitive)

Q. How do you know if your model is undertrained?

Loss still has slope at end of run → token budget too small
Eval scores still climbing → continue
Compare to Chinchilla scaling law: optimal tokens ≈ 20× params for dense, more for fixed model size

B. Scaling Laws

Q. State the Chinchilla finding.

For a fixed compute budget C ≈ 6 N D (N = params, D = tokens), loss is minimized when N and D scale roughly equally — D ≈ 20×N tokens. Earlier laws (Kaplan) overweighted N → trained 175B models on too-few tokens.

Q. How do you predict the loss of an N-param model from smaller runs?

Run 5-10 small models at varied (N, D), fit a power law L(N, D) = L0 + A/N^α + B/D^β. Extrapolate. Validate the extrapolation by training one slightly-larger model and checking it falls on the curve. This is how you decide whether the next compute order of magnitude is worth spending.

Q. What scales sublinearly with model size and what scales super-linearly?

Sub: bytes per param (quantization helps), inference latency per token (batch absorbs fixed cost), data preparation cost.
Super: KV-cache memory per request × concurrency, eval cost (more capabilities to test), engineering complexity (parallelism interactions).

C. Optimization & Architecture Decisions

Q. Why AdamW and not vanilla Adam?

Vanilla Adam couples L2 regularization with the adaptive learning rate, which is mathematically wrong for Adam's update rule. AdamW decouples weight decay (θ ← θ - η · wd · θ separately). Empirically: better generalization, especially at large scale. It's the default; using Adam in 2024 is a smell.

Q. Why is LR warmup necessary?

At init, the loss surface near a random point has high curvature; large LR steps overshoot and destabilize. Linear warmup over the first 0.5-1% of steps lets the model find a smoother region first. Schedule: linear_warmup → cosine_decay is the workhorse.

Q. What's WSD and why is it interesting?

Warmup-Stable-Decay: warmup → flat LR for most of training → fast cosine decay over last 10-20%. Lets you take any intermediate checkpoint and finish the decay in a short fine-tune, getting near-optimal final loss without committing to a token budget upfront. Good for "I might want to train longer later."

Q. Why use RMSNorm over LayerNorm?

RMSNorm drops the mean-subtraction (only divides by RMS), no bias term. ~10-20% faster, no measurable quality loss in practice. All modern LLMs use it.

Q. SwiGLU vs ReLU vs GELU.

SwiGLU (Llama, Qwen, Mistral): (W1 x ⊙ silu(W2 x)) W3 — gated linear unit with Swish/SiLU. Costs ~50% more FFN params but better quality at fixed FLOPS. GELU was the GPT-2/3 default; ReLU is for older models / very small budget.

D. Data

Q. How would you decide on the optimal mix of (web, code, books, math) in pretraining?

DSIR / DoReMi: weight domains by the gradient they provide on a target eval distribution.
Ablation: small-model sweep over weights at fixed compute; pick mix maximizing target eval.
Refresh frequently — optimal mix shifts as model size changes (small models prefer easier data; large models extract more from harder).

Q. Cleaning vs scale: when should you stop adding more data?

When the marginal utility of an additional billion tokens is less than the engineering cost to clean them. Once you're below ~80% English-Wikipedia-like quality, mixing in low-quality web tokens hurts. Better to upsample the high-quality slice.

E. Soft / Judgment Questions

Q. Your evals show your new model is +2% on benchmarks but qualitatively users say it feels worse. What do you do?

Trust the qualitative signal — benchmarks lag user perception.
Look for over-optimization on RL signal (sycophancy, verbosity, refusing borderline requests).
Run pairwise human eval (or trusted LLM-judge) on real user queries, not benchmark ones.
Specifically check: response length distribution, refusal rate, hedging language frequency.

Q. You have 1 month and 10 GPUs to improve a chat model. What do you do?

Highest expected value, in order:

Better SFT data (collect 10k high-quality demos > more compute on bad data).
DPO on preference pairs (cheap; big quality win).
Specific eval-driven fixes (find biggest regression, target SFT it).
Distill outputs of a stronger judge into your model.

You probably do NOT spend the GPUs on bigger model or longer pretrain — bad ROI vs data quality.

Q. How do you know an architecture change is "worth it"?

Run at 3+ scales; check if the gain is consistent (or if it shrinks/grows with scale).
Account for FLOP cost — e.g., adding more params is not a fair test.
Check eval generalization, not just loss.
The change must be reproducible by a teammate from your config, not just folklore.