[Bug] The ifeval evaluation on 541 cases returns unexpected low score.

Open lebronjamesking opened this issue 10 months ago • 0 comments

Prerequisite

[x] I have searched Issues and Discussions but cannot get the expected help.
[x] The bug has not been fixed in the latest version.

Type

I'm evaluating with the officially supported tasks/models/datasets.

Environment

from opencompass.models import VLLMwithChatTemplate

models = [ dict( type=VLLMwithChatTemplate, abbr='qwa-32b', path='path to QwQ-32B', model_kwargs=dict(tensor_parallel_size=4), max_out_len=32768, max_seq_len=131072, batch_size=4, generation_kwargs=dict(temperature=0), run_cfg=dict(num_gpus=4), ) ]

Reproduces the problem - code/configuration sample

export CUDA_VISIBLE_DEVICES=0,1,2,3

export CUDA_VISIBLE_DEVICES=1,2,3,6 export VLLM_WORKER_MULTIPROC_METHOD=spawn nohup python3 run.py --models vllm_qwq_32b_preview --datasets IFEval_gen_3321a3 --debug >lmd2.log 2>&1 &

Reproduces the problem - command or script

from opencompass.openicl.icl_prompt_template import PromptTemplate from opencompass.openicl.icl_retriever import ZeroRetriever from opencompass.openicl.icl_inferencer import GenInferencer from opencompass.datasets import IFEvalDataset, IFEvaluator

ifeval_reader_cfg = dict( input_columns=['prompt'], output_column='reference')

ifeval_infer_cfg = dict( prompt_template=dict( type=PromptTemplate, template=dict(round=[ dict( role='HUMAN', prompt='{prompt}'), ])), retriever=dict(type=ZeroRetriever), inferencer=dict(type=GenInferencer, max_out_len=32768))

ifeval_eval_cfg = dict( evaluator=dict(type=IFEvaluator), pred_role='BOT', )

ifeval_datasets = [ dict( abbr='IFEval', type=IFEvalDataset, path='data/ifeval/ifeval.jsonl', reader_cfg=ifeval_reader_cfg, infer_cfg=ifeval_infer_cfg, eval_cfg=ifeval_eval_cfg) ]

Reproduces the problem - error message

raw format ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Model: qwa-32b IFEval: {'Prompt-level-strict-accuracy': 34.011090573012936, 'Inst-level-strict-accuracy': 48.80095923261391, 'Prompt-level-loose-accuracy': 37.52310536044362, 'Inst-level-loose-accuracy': 52.15827338129496}

Other information

Do you know where is the problem, as qwq-32b official score on the ifeval dataset is something 83+. BTW, I'm using the ifeval dataset from source https://huggingface.co/datasets/google/IFEval

Mar 29 '25 00:03 lebronjamesking