SeVa Question about the evaluation

Hi, thank you for your great work!

I tried to test the model that you released! But I found the evaluation results are different from the LLaVA repo.

I evaludated the POPE benchmark.

Model	LLaVA repo	SeVa repo
LLaVA-1.5	85.91	86.288
SeVa-7b-diffu500	85.10	86.719

BTW, I found the temperature used in SeVa inference is 1.0. But when I evaluate SeVa with temperature=1.0 in LLaVA repo, also got 85.10.

Have any comments on these? Thank you very much!

Jul 22 '24 08:07 Wang-Xiaodong1899

Hi xiaodong,

Thanks for your comments.

I guess it is because the version of LLaVA and transformers.

notes:

In our seva codebase, we adopt transformers 4.31.0.
In our evaluation using LLaVA codebase (e.g., MMVet), we adopt LLaVA 1.1.3 with transformers 4.31.0.

Hope this could help.

Best,

Jul 26 '24 07:07 Kevinz-code

Thanks a lot. I will try it later!

Jul 26 '24 07:07 Wang-Xiaodong1899

Hi! Ke,

I have tried to LLaVA 1.1.3 with transformers 4.31.0. I can not reproduce the POPE results in LLaVA evaluation pipeline, following their guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope). However, if I use your repo (using the https://github.com/Kevinz-code/SeVa/blob/main/run/eval_pope_diffu500.sh), I can 86.719% POPE result.

There are also some strange things, I found the different LLaVA versions affect the Science-QA performance. If I use the latest transformers==4.37.2, the baseline LLaVA-v1.5 got 69.46%. When transformers==4.31.0, LLaVA-v-1.5 got 67.97%. Hard to say...

I want to know, in your experiments, besides POPE, is it true that other benchmarks' evaluation follows the guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope)?

Jul 27 '24 12:07 Wang-Xiaodong1899

Hi xiaodong,

It is true that the transformers version (4.31 vs 4.37) caused the evaluation mismatch, which we also found during experiment.

Our evaluation pipeline follows the below guidance.

POPE: we follow the code of https://github.com/opendatalab/HA-DPO/ using transformers 4.31.0, since its repo provide fast inference in POPE
All other benchmark (e.g., MMBench, SQA, TextVQA, MMVet, etc.): we follow LLaVA official Evaluation guidance using transformers 4.31.0.

So I think the performance mismatch in POPE might derive from environment difference in its two difference repos (e.g., HA-DPO vs LLaVA).

Still, you could conduct experiment using the same transformers version for a better comparison.

Best,

Jul 31 '24 02:07 Kevinz-code

Hi! Ke,

I have tried to LLaVA 1.1.3 with transformers 4.31.0. I can not reproduce the POPE results in LLaVA evaluation pipeline, following their guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope). However, if I use your repo (using the https://github.com/Kevinz-code/SeVa/blob/main/run/eval_pope_diffu500.sh), I can 86.719% POPE result.

There are also some strange things, I found the different LLaVA versions affect the Science-QA performance. If I use the latest transformers==4.37.2, the baseline LLaVA-v1.5 got 69.46%. When transformers==4.31.0, LLaVA-v-1.5 got 67.97%. Hard to say...

I want to know, in your experiments, besides POPE, is it true that other benchmarks' evaluation follows the guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope)?

Hi, I tried to evaluate the baseline LLaVA-v1.5 under transformers==4.37.2 and transformers==4.31.0, but I got similar results around 69%. Do you mean to change the transformer version directly with 'pip install' or other things?

Thanks you very much

Aug 14 '24 18:08 ppalantir

@ppalantir Hi, thanks for your letter.

The version difference are correlated with the llava version.

In our current reproduction, llava is set as 1.1.3 https://github.com/haotian-liu/LLaVA/tree/v1.1.3?tab=readme-ov-file, and the transformer version is 4.31.

Should you use transformer 4.37, you'd better update the llava to 1.2.x (e.g., 1.2.2), which will automatically upgrade transformers to 4.37.

The performance in SQA in our experiments are affected by transformer version, while other benchmarks results such as textvqa are similar.

Best,

Aug 15 '24 02:08 Kevinz-code