Qwen2-VL-7B MCQ accuracy on perception cannot be reproduced.
I used Qwen2-VL-7B-Instruct to evaluate perception MCQs. The model output "going ahead" for most questions, making the accuracy ~50%, while the score reported in the paper is 59%. Did you modify the system prompt or user prompts when evaluating?
Hi @xuan-li, Thanks for your feedback. May I know what script you are using the run the inference? Are you using the same script we provide here?
One possible reason could be: We only input a single-view image if that specific camera is mentioned in the question, e.g., <CAM_FRONT>, which applies to all the perception-MCQs.
Yes I used that script. I also tried the naive HF way to setup the model and got the same result.
I indeed used the single image as the input. Using all six views makes the accuracy to 0.4.
I see, I will double-check on my end. Thanks for reporting.
I found the issue. If I use your provided parser, I can produce the result. But I think there is a bug in your parser.
If the generated text is "b. going ahead.\n\nexplanation: the object <c1,cam_back,0.5583,0.5519> is a vehicle that is moving forward on the road, as indicated by its position relative to the lane markings and the direction it is facing.", the parsed result will be 'A' parsed from "is a vehicle".
That's a good finding. So, seems like Qwen2-VL will also always give "Going ahead" no matter what the image is. I will fix the parse function to make it more robust.