insight-bench
insight-bench copied to clipboard
Llama3-as-a-judge issues
I am having quite some troubles with the Llama-3-as-a-judge pipeline. Here are two issues I've encountered:
- If Llama does not provide a valid score and hence the index here gets out of bound, then the current script would get into an infinite loop due to the While True try catch logic. Did you encounter this? I am currently just setting the scores to be 0 for these cases.
- More importantly, many of my LLaMA scores appear to be incorrect. The model seems especially prone to copying the example score of “7” shown in this line , which results in a large number of falsely high evaluation scores. Could you share a few sample input–score pairs from your baseline runs so I can better debug this issue? For instance, here is one evaluation I have that seems very wrong:
{
"pred_insight": "The \"Dell Latitude 7490\" stands out as the only configuration item with variability in declined amounts, exhibiting a standard deviation of 2,404.49, confirming its unique pattern among the analyzed data.",
"gt_insight": "No Correlation Between the Number of Expense Reports Submitted and Rejection Rates",
"score": 0.7857845326994692
},
I believe the only modification I made to the pipeline is adding a "chat_template" field to the tokenizer_config.json of Meta-Llama-3-70B. If I didn't add this line I'd get:
[serving_chat.py:251] ValueError: As of transformers v4.44, default chat template is no longer allowed, so you must provide a chat template if the tokenizer does not define one.
from VLLM. I simply copied the chat_template from https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct.
@george1459
- re the infinite loop problem, theoretically, yes, it can happen, but only if the LLM fails to generate a valid score every single time, which I'd be very surprised to see happen (even with temp=0).
- how many datasets have you run it for? this is a limitation of the LLM not being smart enough, which we'll always have with the weaker open models. we also saw some false positives, but the average scores across multiple datasets was between 0.5 - 0.6, and one can make an argument for that too, saying the LLaMA tries to play it safe by not generating the extremum values (which is true), but what matters more is the macro-trend it generates. open models have come a long way since we did our exps in mid 2024 and i'd switch to a better model like qwen-3 or gpt-oss-120b for more reliable LLM scores.