Question about the evaluation
Hi, thank you for your great work!
I tried to test the model that you released! But I found the evaluation results are different from the LLaVA repo.
I evaludated the POPE benchmark.
| Model | LLaVA repo | SeVa repo |
|---|---|---|
| LLaVA-1.5 | 85.91 | 86.288 |
| SeVa-7b-diffu500 | 85.10 | 86.719 |
BTW, I found the temperature used in SeVa inference is 1.0. But when I evaluate SeVa with temperature=1.0 in LLaVA repo, also got 85.10.
Have any comments on these? Thank you very much!
Hi xiaodong,
Thanks for your comments.
I guess it is because the version of LLaVA and transformers.
notes:
- In our seva codebase, we adopt transformers 4.31.0.
- In our evaluation using LLaVA codebase (e.g., MMVet), we adopt LLaVA 1.1.3 with transformers 4.31.0.
Hope this could help.
Best,
Thanks a lot. I will try it later!
Hi! Ke,
I have tried to LLaVA 1.1.3 with transformers 4.31.0. I can not reproduce the POPE results in LLaVA evaluation pipeline, following their guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope). However, if I use your repo (using the https://github.com/Kevinz-code/SeVa/blob/main/run/eval_pope_diffu500.sh), I can 86.719% POPE result.
There are also some strange things, I found the different LLaVA versions affect the Science-QA performance. If I use the latest transformers==4.37.2, the baseline LLaVA-v1.5 got 69.46%. When transformers==4.31.0, LLaVA-v-1.5 got 67.97%. Hard to say...
I want to know, in your experiments, besides POPE, is it true that other benchmarks' evaluation follows the guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope)?
Hi xiaodong,
It is true that the transformers version (4.31 vs 4.37) caused the evaluation mismatch, which we also found during experiment.
Our evaluation pipeline follows the below guidance.
- POPE: we follow the code of https://github.com/opendatalab/HA-DPO/ using transformers 4.31.0, since its repo provide fast inference in POPE
- All other benchmark (e.g., MMBench, SQA, TextVQA, MMVet, etc.): we follow LLaVA official Evaluation guidance using transformers 4.31.0.
So I think the performance mismatch in POPE might derive from environment difference in its two difference repos (e.g., HA-DPO vs LLaVA).
Still, you could conduct experiment using the same transformers version for a better comparison.
Best,
Hi! Ke,
I have tried to LLaVA 1.1.3 with transformers 4.31.0. I can not reproduce the POPE results in LLaVA evaluation pipeline, following their guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope). However, if I use your repo (using the https://github.com/Kevinz-code/SeVa/blob/main/run/eval_pope_diffu500.sh), I can 86.719% POPE result.
There are also some strange things, I found the different LLaVA versions affect the Science-QA performance. If I use the latest transformers==4.37.2, the baseline LLaVA-v1.5 got 69.46%. When transformers==4.31.0, LLaVA-v-1.5 got 67.97%. Hard to say...
I want to know, in your experiments, besides POPE, is it true that other benchmarks' evaluation follows the guidance (https://github.com/haotian-liu/LLaVA/blob/main/docs/Evaluation.md#pope)?
Hi, I tried to evaluate the baseline LLaVA-v1.5 under transformers==4.37.2 and transformers==4.31.0, but I got similar results around 69%. Do you mean to change the transformer version directly with 'pip install' or other things?
Thanks you very much
@ppalantir Hi, thanks for your letter.
The version difference are correlated with the llava version.
In our current reproduction, llava is set as 1.1.3 https://github.com/haotian-liu/LLaVA/tree/v1.1.3?tab=readme-ov-file, and the transformer version is 4.31.
Should you use transformer 4.37, you'd better update the llava to 1.2.x (e.g., 1.2.2), which will automatically upgrade transformers to 4.37.
The performance in SQA in our experiments are affected by transformer version, while other benchmarks results such as textvqa are similar.
Best,