[New Task] Add AlpacaEval LC
Great library, a light library for all the main evals was really needed!💯
I just came across this line: is there any interest of adding length-controlled AlpacaEval to lighteval? If so I'm happy to help, e.g., if you want a minimal script that doesn't depend on alpaca_eval.
Let me know
Hi! This would be amazing, thanks for the suggestion!
Ideal would be to add it as a community task for now, and once we have non regression tests on the results, we'll move it to extended tasks.
However, since it's using LLM as a judge, we would want to first move the LLM-as-a-judge code that @NathanHB developed for MTBench to the metrics, and allow to select several judges. (We will want this to be homogeneous for easier debugging).
If you are interested in this, you can start by it, else you can wait for us to add it, it should be integrated soon.
I saw the PR, it looks great and homogeneity definitely makes sense.
Adding AlpacaEval might require a few changes for homogenization though.
The pipeline for AlpacaEval at a high level is:
- For each instruction, decode the model outputs and add the reference outputs.
- Randomize the order of the model and the reference. One becomes
Mand the othermbut the mapping is random. This is important given that LLM judges typically prefer the last output. - OpenAI's GPT4 Preview judges its preference by asking a single token (
Morm) with logprobs. Outputing only a single token decreases the eval time, the cost, and simplifies logprob decoding. Using logprobs improves statistical efficiency and alleviates decoding issues. - Extract the raw preference by taking the logprob of the evaluated model (say
M) normalized by the probability ofMandm. - Control the length bias of the preference by fitting a simple GLM on all the preferences from that model. This takes seconds even on a single CPU.
- Average all the length-controlled preferences over the AlpacaEval set to get the final LC win rate.
I only had a quick skim through the MTBench PR, but my understanding is that steps 2, 3 and 4 would all require slight changes to JudgeOpenAI. Step 5 would require a more significant change as it requires processing all the preferences together. I'm not sure where you'd want that step.
I'm curious to hear your thoughts!
I won't have the time to do such homoneigenzation, and in any case I guess you'd prefer choosing the right abstraction yourselves! But I'm happy to help if there's interest in supporting AlpacaEval, e.g., by writing some minimal implementation.
Thanks for detailing these steps! We'll edit the LLM as a judge metric on our side, and come back to you once we're good, we'd love to support AlpacaEval with your help :)
Hi ! Thanks for your interest in lighteval !
It seems like integrating alpaca_eval would require a custom function in the JudgeOpenAi class, as it is not as simple as calling the judge and extracting an answer.
I opened a PR to move the extended code to the metrics.
I only had a quick skim through the MTBench PR, but my understanding is that steps 2, 3 and 4 would all require slight changes to JudgeOpenAI. Step 5 would require a more significant change as it requires processing all the preferences together. I'm not sure where you'd want that step.
Step 5 should be easy to add. We have a system that allows to plug functions acting on the whole corpus instead of the individual samples.
For example,
mt_bench_metric = SampleLevelMetricGrouping(
metric=["single_turn", "multi_turn"],
higher_is_better=True,
category=MetricCategory.GENERATIVE_MULTI_TURN,
use_case=MetricUseCase.SUMMARIZATION,
sample_level_fn=LlmAsJudge(
judge_model_name="gpt-3.5-turbo", template_path="src/lighteval/tasks/extended/mt_bench/judge_prompts.jsonl"
).compute_multi_turn,
corpus_level_fn={
"single_turn": np.mean,
"multi_turn": np.mean,
},
)
Here, the sample is evaluated by the judge and the whole corpus is evaluated using the mean of all samples. We could replace np.mean by a function doing the step 5. :)
That would make a metric for Alpaca look like:
alpaca_metric = SampleLevelMetric(
metric="lc_alpaca",
higher_is_better=True,
category=MetricCategory.GENERATIVE,
use_case=MetricUseCase.SUMMARIZATION,
sample_level_fn=LlmAsJudge(
judge_model_name="gpt-4", template_path="path/to/alpaca_judge_template.jsonl"
).compute_alpaca,
corpus_level_fn=length_controlled_mean,
)
Great, to know that there's a place for a corpus level function, I can write a minimal length_controlled_mean when the times come. Let me know if you have questions for the rest!
Hi @YannDubs ! We have now extended and merged the model-as-a-judge metrics, do you think they would work for you in their current state?
Hey @clefourrier!
So the current JudgeOpenAI still seems pretty specialized to MT-bench. E.g. it makes a few assumptions that will not be true for AlpacaEval and more generally for other LLMJudge benchmark. E.g.:
- regular expression for __process_judge_response
- working only with the text of the output
-
"single-math-v1-multi-turn"for the reference prompt
1 and 2 is what we did in AlpacaEval but switched to logprobs of tokens as it's cheaper and gives more statistical efficiency.
Do you want different classes (say an MTBenchJudge class and an AlpacaEvalJudge class) or different parameters in the main Judge class? I can implement something minimal next weekend. But it will probably be easier if you end up writing the final abstraction that you would like to keep!
Tagging @NathanHB since he worked on it the most, but imo it would be great to have the option to pass different parameters in the main Judge class, and we'll load it with different metric definitions like the above example for mt_bench_metric vs alpaca_metric.
hi @YannDubs ! having multiple parameters passed to the judge would be our preferred way. for example using a parameter to switch from using logprobs and text.
as for the point 3 it is simply a matter of changing the llm_judge prompt
don't hesitate to tell us if you have more questions !
can we directly evaluate alpacaeval now?