VLMEvalKit icon indicating copy to clipboard operation
VLMEvalKit copied to clipboard

[Benchmark]: Support CharXiv Benchmark (#921)

Open MaoSong2022 opened this issue 9 months ago • 2 comments

CharXiv benchmark is a visual question answer (VQA) that is specified on Chart domain. CharXiv consists of two splits (val, test) and two modes descriptive and reasoning, this PR completes val split, reasulting two testable benchmarks:

  • CharXiv_reasoning_val, consisting 1000 images, each corresponding 1 question
  • CharXiv_descriptive_val, consisting 1000 images (shared with reasoning mode), each corresponding to 4 questions. Modification: the original benchmark groups answer to simplify grading process, in this PR, each line is processed individually

Evaluated results (Qwen-VL-Plus) with qwen-plus as judge model): Reasoning Validation Results:

Text-in-Chart Number-in-Chart Number-in-General Text-in-General Overall
0.214 0.095 0.066 0.273 0.158

Descriptive Validation Results:

Information Extraction Enumeration Pattern Recognition Counting Compositionality Overall
0.715 0.537 0.585 0.690 0.058 0.606

References: project page

MaoSong2022 avatar Apr 28 '25 12:04 MaoSong2022

TSV file not ready

kennymckormick avatar Apr 29 '25 13:04 kennymckormick

fixed

MaoSong2022 avatar Apr 30 '25 01:04 MaoSong2022

fix judge model argument error

judge_model = judge_kwargs.get("model", "gpt-4o-mini")

now the default judge model will be gpt-4o-mini

MaoSong2022 avatar May 07 '25 08:05 MaoSong2022

@MaoSong2022

In your latest commit there is still a problem: since you did not specify the model argument in judge_kwargs directly, the build_judge function still throws out an error.

I have helped fix the problem (you can check the changes in my latest commit).

kennymckormick avatar May 08 '25 08:05 kennymckormick

The evaluation results of GPT-4.1: upper (description), lower (reasoning)

image

kennymckormick avatar May 08 '25 08:05 kennymckormick