MMMU Add validation set to EvalAI

Would it be possible to add MMMU validation to EvalAI?

It'd be great to be able to compare the numbers calculated on the validation set with the ones produced by EvalAI.

Aug 14 '24 18:08 dchichkov

Thank you! That is a good suggestion. We will consider it! Will update here later!

Aug 14 '24 18:08 xiangyue9607

Thanks! The issue is, we see a consistent gap between validation and test set results. While models did not use the validation set to optimize. Multiple teams resorted to reporting validation rather than test results in their papers. I'm guessing this could be because they don't trust the test result (which they can't repro/validate). It'd be good to triage and rectify that, at least by having the validation part reproducible with the EvalAI measurement.

MMMU is a great benchmark. It measures overall LLM/VLM performance. But these issues with test/validation discrepancies (and misunderstanding that it's not just the visual part that matters) give it some bad light.

I'd also suggest considering releasing the test set, maybe under a separate NC license and token password protection, to avoid accidental contamination. The benefits of the test set being used and potential cleanup/resolving this test/validation gap can outweigh the benefits of using the test set in a more controlled environment.

Aug 15 '24 17:08 dchichkov

Thank you for your feedback. The discrepancy between the validation and test sets arises from the slight differences in their distributions. In the validation set, each subject has an equal number of samples, whereas in the test set, the number of samples per subject varies.

We are also considering releasing a portion of the test set while retaining a small part to prevent contamination or overfitting. We appreciate your valuable comments and encourage you to stay tuned for further updates!

Aug 15 '24 18:08 xiangyue9607