[Bug] 评估wikitext ppl数据集时候由于没有reference无法计算结果
先决条件
问题类型
我正在使用官方支持的任务/模型/数据集进行评估。
环境
克隆opencompass仓库并进入,并安装需要的包,完成环境配置
git clone https://github.com/open-compass/opencompass.git
cd opencompass
pip install -r requirements.txt
重现问题 - 代码/配置示例
我实现了eval_qwen2_7b.py,放在了configs目录下,用于评估Qwen-7b在wikitext数据集的ppl结果
from mmengine.config import read_base
from opencompass.models import HuggingFaceBaseModel
with read_base():
from opencompass.configs.datasets.wikitext.wikitext_103_raw_ppl import wikitext_103_raw_datasets
datasets = wikitext_103_raw_datasets
models = [
dict(
type=HuggingFaceBaseModel,
abbr='qwen-7b-hf',
path='Qwen/Qwen-7B',
max_out_len=1024,
batch_size=32,
run_cfg=dict(num_gpus=2),
)
]
重现问题 - 命令或脚本
然后按照opencompass的README文件进行评估
python -u run.py configs/eval_qwen2_7b.py -w outputs/qwen2_7b --debug
重现问题 - 错误信息
日志结果如下:
python run.py configs/eval_qwen2_7b_wikitext.py -w outputs/qwen2_7b --debug
10/14 15:48:59 - OpenCompass - INFO - Current exp folder: outputs/qwen2_7b/20241014_154859
10/14 15:48:59 - OpenCompass - WARNING - SlurmRunner is not used, so the partition argument is ignored.
10/14 15:48:59 - OpenCompass - INFO - Partitioned into 1 tasks.
10/15 01:25:25 - OpenCompass - INFO - Partitioned into 2 tasks.
Traceback (most recent call last):
File "/Users/lann/opencompass/eval/run.py", line 4, in <module>
main()
File "/Users/lann/opencompass/eval/cli/main.py", line 351, in main
runner(tasks)
File "/Users/lann/opencompass/opencompass/runners/base.py", line 38, in __call__
status = self.launch(tasks)
^^^^^^^^^^^^^^^^^^
File "/Users/lann/opencompass/opencompass/runners/local.py", line 131, in launch
task.run()
File "/Users/lann/opencompass/opencompass/tasks/openicl_eval.py", line 114, in run
self._score()
File "/Users/lann/opencompass/opencompass/tasks/openicl_eval.py", line 250, in _score
result = icl_evaluator.score(**preds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/lann/opencompass/opencompass/openicl/icl_evaluator/icl_hf_evaluator.py", line 70, in score
if len(predictions) != len(references):
^^^^^^^^^^^^^^^
TypeError: object of type 'NoneType' has no len()
我查看outputs/qwen2_7b/predictions/wikitext-103-raw-validation.json的结果,发现里面没有gold,只有模型的predictions结果,因此在判断if len(predictions) != len(references):的时候,references是NoneType,其数量与predictions不一致,评估过程报错。
其他信息
No response
你好,请问你现在解决了这个问题吗
没有解决,我后续转向使用lm-evaluation-harness做评估
没有解决,我后续转向使用lm-evaluation-harness做评估
你好,这个lm-evaluation-harness我看也有人说,这个好像测wikitext和baseline不准确,请问你解决了吗?
而且我使用那个出现这个问题,但是我的模型是0.5b bs为1,请问你有遇到过这个问题吗,谢谢
/miniconda3/lib/python3.12/site-packages/torch/cuda/memory.py", line 738, in mem_get_info
return torch.cuda.cudart().cudaMemGetInfo(device)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.