[Benchmark]: Support CharXiv Benchmark (#921)
CharXiv benchmark is a visual question answer (VQA) that is specified on Chart domain. CharXiv consists of two splits (val, test) and two modes descriptive and reasoning, this PR completes val split, reasulting two testable benchmarks:
-
CharXiv_reasoning_val, consisting 1000 images, each corresponding 1 question -
CharXiv_descriptive_val, consisting 1000 images (shared with reasoning mode), each corresponding to 4 questions. Modification: the original benchmark groups answer to simplify grading process, in this PR, each line is processed individually
Evaluated results (Qwen-VL-Plus) with qwen-plus as judge model): Reasoning Validation Results:
| Text-in-Chart | Number-in-Chart | Number-in-General | Text-in-General | Overall |
|---|---|---|---|---|
| 0.214 | 0.095 | 0.066 | 0.273 | 0.158 |
Descriptive Validation Results:
| Information Extraction | Enumeration | Pattern Recognition | Counting | Compositionality | Overall |
|---|---|---|---|---|---|
| 0.715 | 0.537 | 0.585 | 0.690 | 0.058 | 0.606 |
References: project page
TSV file not ready
fixed
fix judge model argument error
judge_model = judge_kwargs.get("model", "gpt-4o-mini")
now the default judge model will be gpt-4o-mini
@MaoSong2022
In your latest commit there is still a problem: since you did not specify the model argument in judge_kwargs directly, the build_judge function still throws out an error.
I have helped fix the problem (you can check the changes in my latest commit).
The evaluation results of GPT-4.1: upper (description), lower (reasoning)