opencompass [Bug] 评估wikitext ppl数据集时候由于没有reference无法计算结果

先决条件

[X] 我已经搜索过问题和讨论但未得到预期的帮助。
[X] 错误在最新版本中尚未被修复。

问题类型

我正在使用官方支持的任务/模型/数据集进行评估。

环境

克隆opencompass仓库并进入，并安装需要的包，完成环境配置

git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -r requirements.txt

重现问题 - 代码/配置示例

我实现了eval_qwen2_7b.py，放在了configs目录下，用于评估Qwen-7b在wikitext数据集的ppl结果

from mmengine.config import read_base
from opencompass.models import HuggingFaceBaseModel

with read_base():
    from opencompass.configs.datasets.wikitext.wikitext_103_raw_ppl import wikitext_103_raw_datasets

datasets = wikitext_103_raw_datasets

models = [
    dict(
        type=HuggingFaceBaseModel,
        abbr='qwen-7b-hf',
        path='Qwen/Qwen-7B',
        max_out_len=1024,
        batch_size=32,
        run_cfg=dict(num_gpus=2),
    )
]

重现问题 - 命令或脚本

然后按照opencompass的README文件进行评估

python -u run.py configs/eval_qwen2_7b.py -w outputs/qwen2_7b --debug

重现问题 - 错误信息

日志结果如下：

python run.py configs/eval_qwen2_7b_wikitext.py -w outputs/qwen2_7b --debug
10/14 15:48:59 - OpenCompass - INFO - Current exp folder: outputs/qwen2_7b/20241014_154859
10/14 15:48:59 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
10/14 15:48:59 - OpenCompass - INFO - Partitioned into 1 tasks.
10/15 01:25:25 - OpenCompass - INFO - Partitioned into 2 tasks.
Traceback (most recent call last):
  File "/Users/lann/opencompass/eval/run.py", line 4, in <module>
    main()
  File "/Users/lann/opencompass/eval/cli/main.py", line 351, in main
    runner(tasks)
  File "/Users/lann/opencompass/opencompass/runners/base.py", line 38, in __call__
    status = self.launch(tasks)
             ^^^^^^^^^^^^^^^^^^
  File "/Users/lann/opencompass/opencompass/runners/local.py", line 131, in launch
    task.run()
  File "/Users/lann/opencompass/opencompass/tasks/openicl_eval.py", line 114, in run
    self._score()
  File "/Users/lann/opencompass/opencompass/tasks/openicl_eval.py", line 250, in _score
    result = icl_evaluator.score(**preds)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/lann/opencompass/opencompass/openicl/icl_evaluator/icl_hf_evaluator.py", line 70, in score
    if len(predictions) != len(references):
                           ^^^^^^^^^^^^^^^
TypeError: object of type 'NoneType' has no len()

我查看outputs/qwen2_7b/predictions/wikitext-103-raw-validation.json的结果，发现里面没有gold，只有模型的predictions结果，因此在判断if len(predictions) != len(references):的时候，references是NoneType，其数量与predictions不一致，评估过程报错。

其他信息

No response

Oct 15 '24 03:10 LanDisen

你好，请问你现在解决了这个问题吗

Apr 02 '25 15:04 1ucky2

没有解决，我后续转向使用lm-evaluation-harness做评估

Apr 03 '25 00:04 LanDisen

没有解决，我后续转向使用lm-evaluation-harness做评估

你好，这个lm-evaluation-harness我看也有人说，这个好像测wikitext和baseline不准确，请问你解决了吗？而且我使用那个出现这个问题，但是我的模型是0.5b bs为1，请问你有遇到过这个问题吗，谢谢 /miniconda3/lib/python3.12/site-packages/torch/cuda/memory.py", line 738, in mem_get_info return torch.cuda.cudart().cudaMemGetInfo(device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Apr 03 '25 03:04 1ucky2