VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

TextVQA results on LLaVA 1.5 and 1.6

Open kydxh opened this issue 10 months ago • 0 comments

  1. I'm confused as to why for llava1.5, the model used is "liuhaotian/llava-v1.5-7b", but for llava 1.6, it's "llava-hf/llava-v1.6-vicuna-7b-hf" instead of "liuhaotian/llava-v1.6-vicuna-7b"?
  2. I run the code on "llava_v1.5_7b" but got result only 21.9, which is much lower than the official results of LLaVA. Why did such results occur?
  3. For "llava_next_vicuna_7b", when I use the original configuration code ("llava-hf/llava-v1.6-vicuna-7b-hf"), the acc is bout 63.9. However, when I change the model to "liuhaotian/llava-v1.6-vicuna-7b" ("llava_next_vicuna_7b": partial(LLaVA, model_path="liuhaotian/llava-v1.6-vicuna-7b"),), the accuracy is suddenly drops to 25.47.
  4. I notice that for llava, the authors use OCR tokens in the inference process. But it seems that in VLMEvalKit, the OCR tokens are not used?

I'm really confused about these questions. Looking forward to your reply.

kydxh avatar Mar 16 '25 09:03 kydxh