llama.cpp icon indicating copy to clipboard operation
llama.cpp copied to clipboard

Support LLaVA-UHD

Open choyakawa opened this issue 1 year ago • 1 comments

https://github.com/thunlp/LLaVA-UHD

This method is seemingly on par with or better than LLaVA 1.6 Next, however they opensourced the training code for reproduction.

LLM analysis from Gemini 1.5 pro:

Feature LLaVA-UHD-13B LLaVA-NeXT-7B LLaVA-NeXT-13B LLaVA-NeXT-34B LLaVA 1.5-13B
VQAv2 81.7 81.8 (Vicuna) / 82.2 (Mistral) 82.8 83.7 80
GQA 65.2 64.2 (Vicuna) / 64.8 (Mistral) 65.4 67.1 63.3
TextVQA 67.7 64.9 (Vicuna) / 65.7 (Mistral) 67.1 69.5 61.3
ScienceQA 72 70.1 (Vicuna) / 72.8 (Mistral) 73.6 81.8 71.6
VizWiz 56.1 57.6 (Vicuna) / 60.0 (Mistral) 60.5 63.8 53.6
MMU (val) 36.4 35.8 (Vicuna) / 35.3 (Mistral) 36.2 51.1 36.4
MMU (test) 33.6 - - 44.7 33.6
MME 1535 1519 (Vicuna) / 1498 (Mistral) 1575 1631 1531
POPE 89.1 86.5 (Vicuna) / 86.7 (Mistral) 86.2 87.7 85.9

Observations:

  • LLaVA-UHD generally performs better than LLaVA 1.5 across all metrics.
  • LLaVA-NeXT series shows comparable performance to LLaVA-UHD on most tasks, with slight variations depending on the specific model (Vicuna or Mistral).
  • LLaVA-NeXT-34B stands out with significantly higher performance on ScienceQA and MMU tasks.

Originally posted by @choyakawa in https://github.com/thunlp/LLaVA-UHD/issues/1#issuecomment-2005996645

choyakawa avatar Mar 19 '24 07:03 choyakawa

Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5).

choyakawa avatar Mar 19 '24 07:03 choyakawa

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions[bot] avatar May 03 '24 01:05 github-actions[bot]