llama.cpp
llama.cpp copied to clipboard
Support LLaVA-UHD
https://github.com/thunlp/LLaVA-UHD
This method is seemingly on par with or better than LLaVA 1.6 Next, however they opensourced the training code for reproduction.
LLM analysis from Gemini 1.5 pro:
Feature LLaVA-UHD-13B LLaVA-NeXT-7B LLaVA-NeXT-13B LLaVA-NeXT-34B LLaVA 1.5-13B VQAv2 81.7 81.8 (Vicuna) / 82.2 (Mistral) 82.8 83.7 80 GQA 65.2 64.2 (Vicuna) / 64.8 (Mistral) 65.4 67.1 63.3 TextVQA 67.7 64.9 (Vicuna) / 65.7 (Mistral) 67.1 69.5 61.3 ScienceQA 72 70.1 (Vicuna) / 72.8 (Mistral) 73.6 81.8 71.6 VizWiz 56.1 57.6 (Vicuna) / 60.0 (Mistral) 60.5 63.8 53.6 MMU (val) 36.4 35.8 (Vicuna) / 35.3 (Mistral) 36.2 51.1 36.4 MMU (test) 33.6 - - 44.7 33.6 MME 1535 1519 (Vicuna) / 1498 (Mistral) 1575 1631 1531 POPE 89.1 86.5 (Vicuna) / 86.7 (Mistral) 86.2 87.7 85.9 Observations:
- LLaVA-UHD generally performs better than LLaVA 1.5 across all metrics.
- LLaVA-NeXT series shows comparable performance to LLaVA-UHD on most tasks, with slight variations depending on the specific model (Vicuna or Mistral).
- LLaVA-NeXT-34B stands out with significantly higher performance on ScienceQA and MMU tasks.
Originally posted by @choyakawa in https://github.com/thunlp/LLaVA-UHD/issues/1#issuecomment-2005996645
Moreover, the model can be efficiently trained in academic settings, within 23 hours on 8 A100 GPUs (vs. 26 hours of LLaVA-1.5).
This issue was closed because it has been inactive for 14 days since being marked as stale.