opencompass [Bug] Take too much time on MATH-500 dataset evaluation

Prerequisite

[x] I have searched Issues and Discussions but cannot get the expected help.
[x] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

it can inference correctly

Reproduces the problem - code/configuration sample

python run.py --datasets math_500_gen --hf-type base --hf-path /home/maoshizhuo/2025/deepseek-Qwen-1.5B --debug --max-out-len 32768 02/25 23:53:14 - OpenCompass - INFO - Loading math_500_gen: /home/maoshizhuo/2025/opencompass/opencompass/configs/./datasets/math/math_500_gen.py 02/25 23:53:14 - OpenCompass - INFO - Loading example: /home/maoshizhuo/2025/opencompass/opencompass/configs/./summarizers/example.py 02/25 23:53:14 - OpenCompass - INFO - Current exp folder: outputs/default/20250225_235314 02/25 23:53:14 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored. 02/25 23:53:14 - OpenCompass - INFO - Partitioned into 1 tasks. 02/25 23:53:16 - OpenCompass - WARNING - Only use 1 GPUs for total 4 available GPUs in debug mode. 02/25 23:53:16 - OpenCompass - INFO - Task [deepseek-Qwen-1.5B_hf/math-500] 02/25 23:53:33 - OpenCompass - INFO - Try to load the data from /home/maoshizhuo/.cache/opencompass/./data/math/ 02/25 23:53:33 - OpenCompass - INFO - Start inferencing [deepseek-Qwen-1.5B_hf/math-500] 11%|███████████████ | 7/63 [13:49:33<118:18:44, 7605.80s/it]

Reproduces the problem - command or script

python run.py --datasets math_500_gen --hf-type base --hf-path /home/maoshizhuo/2025/deepseek-Qwen-1.5B --debug --max-out-len 32768

Reproduces the problem - error message

need too much time to get result about 131 hours

Other information

有什么办法加速推理吗？我注意到vllm可以加速推理，但是由于其集成了量化技术，得到的精度并不准确，我希望得到准确的结果并加速。我的实验环境有4张V100-32G GPU。谢谢！

Feb 26 '25 07:02 msz12345

If your model is a chat model, try to use --hf-type chat, this will use chat_template of the model. On the other hand, under the hood, OC uses HF to generate, try to call the original HF generate for one example to see if it also takes that long time.

Feb 26 '25 09:02 MaiziXiao

Please check the prediction to find if there is a repeat pattern in response. And reduce the --max-out-len. Also, you can remove the --debug and use four workers.

Feb 26 '25 13:02 tonysy

The best option is to use vllm or lmdeploy, because the math requires the model to generate the reasoning process.

Feb 26 '25 13:02 tonysy

Please check the prediction to find if there is a repeat pattern in response. And reduce the --max-out-len. Also, you can remove the --debug and use four workers.

Could you please help me solve this issue? https://github.com/open-compass/opencompass/issues/1929

Mar 11 '25 01:03 AllenShow