VLMEvalKit [Benchmark]: Support CharXiv Benchmark (#921)

CharXiv benchmark is a visual question answer (VQA) that is specified on Chart domain. CharXiv consists of two splits (val, test) and two modes descriptive and reasoning, this PR completes val split, reasulting two testable benchmarks:

CharXiv_reasoning_val, consisting 1000 images, each corresponding 1 question
CharXiv_descriptive_val, consisting 1000 images (shared with reasoning mode), each corresponding to 4 questions. Modification: the original benchmark groups answer to simplify grading process, in this PR, each line is processed individually

Evaluated results (Qwen-VL-Plus) with qwen-plus as judge model): Reasoning Validation Results:

Text-in-Chart	Number-in-Chart	Number-in-General	Text-in-General	Overall
0.214	0.095	0.066	0.273	0.158

Descriptive Validation Results:

Information Extraction	Enumeration	Pattern Recognition	Counting	Compositionality	Overall
0.715	0.537	0.585	0.690	0.058	0.606

References: project page

Apr 28 '25 12:04 MaoSong2022

TSV file not ready

Apr 29 '25 13:04 kennymckormick

fixed

Apr 30 '25 01:04 MaoSong2022

fix judge model argument error

judge_model = judge_kwargs.get("model", "gpt-4o-mini")

now the default judge model will be gpt-4o-mini

May 07 '25 08:05 MaoSong2022

@MaoSong2022

In your latest commit there is still a problem: since you did not specify the model argument in judge_kwargs directly, the build_judge function still throws out an error.

I have helped fix the problem (you can check the changes in my latest commit).

May 08 '25 08:05 kennymckormick

The evaluation results of GPT-4.1: upper (description), lower (reasoning)

May 08 '25 08:05 kennymckormick