Evals Research Ideas

Posts that inspired this: here, here, here, here

Background:

Consider this scenario:

You just trained a brand new state of the art model. How do you get any sense for what the model can actually do? (one salient example is in AI Control, where you may need to make certain guarantees that we can confidently rule out scheming.)

A pretty common way people try to answer “how smart is my model?” is to run it on MMLU.

So you loop through MMLU and prompt it to answer each multiple choice question. You then take the percentage of questions it answered correctly.

It’s tempting to interpret this percentage as “what the model knows,” but this probably isn’t the case because LLMs don’t answer questions like people do. For example, it might be:

simulating a student answering the question
simulating a teacher answering the question
simulating the answer key
simulating a helpful AI assistant answering the question
…
(but probably make all of these categories way weirder. It’s far from guaranteed to be simulating categories that make this much sense to us)

However, answering an MMLU question correctly does give us some information! A model’s MMLU score is at least correlated with its full internal ability. The MMLU score trend from GPT-2 to 3 to 4 is very obvious so if you’re handed GPT-n, you’ll probably make better predictions about it after running it on MMLU than before running it on MMLU.

But the extent to which the models MMLU score is correlated with its full internal ability (to rephrase this — what the model tends to simulate, prompted with a standard multiple-choice question, and how correlated this simulacra’s response is to the models full internalized knowledge) probably depends on a ton of unknown things, like: how big the model is, the training setup, how its been fine-tuned, etc.

Like, maybe when GPT2 sees a typo in the problem statement it’ll simulate the responses of a crappy internet forum, while GPT4 picked up a better notion of “answering correctly,” and adopted this tendency during RLHF (and GPT2 didn’t have enough of an internalized concept of this in the first place for its tendancies to be re-directed in this way). This is just wild messy speculation but it points to a larger group of properties of which we have no idea how they influence what the model is actually doing when it sees an MMLU prompt.

Project Idea(s):

Idea 1: Maximum performance vs regular eval performance

Aggregate a ton of benchmark scores for a ton of different models. Use base models, RLHF’d models, fine-tuned models, etc.
Get the highest lower-bound we can about the model’s score on a benchmark. I.e. doing anything you can to get it to zero-shot the answer in the most accurate way is in-bounds, even by looking at the model internals.
Max{ score with chain of thought, adding praise to the prompt, etc. etc. etc. }
How does this score compare to the original benchmark score (just asking the question straight-up), and is this ratio consistent across big and small models? Is there any trend?
Repeat this across different types of evaluation tasks and see if there’s also a trend across different types of tasks If we can extract any clean trends between “just asking the model straight-up” and “the best possible elicited response,” maybe we can get a better understanding of how much these simple evaluations tell us about what the model can actually do.

Idea 2: Variance of performance

Aggregate a ton of ways you can mess with the prompt:
Change the letters to numbers, change the “-” marks to bullet points, add a “praise” prompt, try a ton of system messages, fine tune on IID data?, and probably a lot of weirder changes.
Aggregate a ton of different models to try: base models, RLHFd models, models fine-tuned on
Run all of these models on MMLU with all of the different variations
1. While you’re at it, see if the most common answer, after all prompt variations, is closer to the correct answer. Maybe all of the biases cancel out?
Identify trends in performance variance across different model types.
Are RLFHd models way less sensitive? Are larger models more or less sensitive?

Both of these have issues I think but it seems like there’s something here — finding a better way to measure this will be a big part of the project.

Worries:

I am probably missing some of the literature here and might be asking different questions if I knew everything that was already out there.
Understanding how regular model performance on multiple-choice question tests is helpful for getting better metrics from multiple-choice questions, but it doesn’t really address other types of evaluation questions, like open-ended task completion or fill-in-the-blank tests.

Making Little Simz Gorilla Interactive Music Video

Probably Not A Ghost Story