LogicCheckGPT About baseline

A nice work. I would like to ask a question about LURE. LURE needs to mask the object during inference and then correct it. However, POPE and MME are discriminant tasks, using YES/NO to answer questions. How do you test the performance of LURE on these two data sets?

Aug 20 '24 09:08 haohaodw

Thanks for your interest! In our experiments, we have observed that the responses from the four LVLMs to POPE questions are in the format as "Yes/No, there is/isn't {object} ..." This format allows LURE to mask the object. For instance, the responses of mPLUG-Owl to some POPE questions are listed below:

issue_1

The responses of LLaVA-1.5 to some POPE questions are listed below:

issue_2

Aug 22 '24 08:08 Hyperwjf

However, when calculating the accuracy of POPE, the calculation is yes or no. So how do you judge whether the modified response of LURE is correct?

Aug 22 '24 08:08 haohaodw