在相同参数下进行 gsm8k 评测,有精度差异
自查清单
在提交 issue 之前,请确保您已完成以下步骤:
问题描述
在使用 Sglang 后端 + Evalscope 进行 Qwen3-8B 推理评测的时候,多次运行 gsm8k 的精度结果有差异 第一次
2025-08-29 07:48:40,750 - evalscope - INFO - Args: Task config is provided with CommandLine type.
2025-08-29 07:48:42,247 - evalscope - WARNING - Output type generation is not supported for service evaluation. Using server model adapter instead.
2025-08-29 07:48:42,349 - evalscope - INFO - Dump task config to ./outputs/20250829_074840/configs/task_config_4aafbd.yaml
2025-08-29 07:48:42,352 - evalscope - INFO - {
"model": "Qwen3-8B",
"model_id": "Qwen3-8B",
"model_args": {},
"model_task": "text_generation",
"template_type": null,
"chat_template": null,
"datasets": [
"gsm8k"
],
"dataset_args": {
"gsm8k": {
"name": "gsm8k",
"dataset_id": "modelscope/gsm8k",
"model_adapter": "generation",
"output_types": [
"generation"
],
"subset_list": [
"main"
],
"metric_list": [
"AverageAccuracy"
],
"few_shot_num": 4,
"few_shot_random": false,
"train_split": null,
"eval_split": "test",
"prompt_template": "Question: {query}\nLet's think step by step\nAnswer:",
"system_prompt": null,
"query_template": null,
"pretty_name": "GSM8K",
"description": "GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.",
"tags": [
"Mathematics"
],
"filters": null,
"extra_params": {}
}
},
"dataset_dir": "/root/.cache/modelscope/hub/datasets",
"dataset_hub": "modelscope",
"generation_config": {
"max_tokens": 2048,
"temperature": 0.0
},
"eval_type": "service",
"eval_backend": "Native",
"eval_config": null,
"stage": "all",
"limit": null,
"eval_batch_size": 512,
"mem_cache": false,
"use_cache": null,
"work_dir": "./outputs/20250829_074840",
"outputs": null,
"ignore_errors": false,
"debug": false,
"dry_run": false,
"seed": 42,
"api_url": "http://127.0.0.1:30000/v1",
"api_key": "EMPTY",
"timeout": null,
"stream": false,
"judge_strategy": "auto",
"judge_worker_num": 1,
"judge_model_args": {},
"analysis_report": false
}
2025-08-29 07:48:42,352 - evalscope - INFO - Start evaluating on dataset modelscope/gsm8k
2025-08-29 07:48:42,352 - evalscope - INFO - Loading dataset from hub: modelscope/gsm8k
2025-08-29 07:48:42,537 - evalscope - INFO - Loading dataset: dataset_name: modelscope/gsm8k > subsets: ['main']
2025-08-29 07:48:49,503 - evalscope - INFO - Use settings: > few_shot_num: 4, > few_shot_split: None, > target_eval_split: test
Predicting(main): 100%|███████████████████████████████████| 1319/1319 [05:15<00:00, 4.17it/s]
2025-08-29 07:54:05,607 - evalscope - INFO - Dump predictions to ./outputs/20250829_074840/predictions/Qwen3-8B/gsm8k_main.jsonl.
Reviewing(main): 100%|██████████████████████████████████| 1319/1319 [00:00<00:00, 2413.41it/s]
2025-08-29 07:54:06,238 - evalscope - INFO -
modelscope/gsm8k report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k | AverageAccuracy | main | 1319 | 0.906 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+
2025-08-29 07:54:06,238 - evalscope - INFO - Skipping report analysis (`analysis_report=False`).
2025-08-29 07:54:06,239 - evalscope - INFO - Dump report to: ./outputs/20250829_074840/reports/Qwen3-8B/gsm8k.json
2025-08-29 07:54:06,239 - evalscope - INFO - Evaluation finished on modelscope/gsm8k
2025-08-29 07:54:06,247 - evalscope - INFO - Overall report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k | AverageAccuracy | main | 1319 | 0.906 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+
第二次
root@yy-gpu-prefill-0:/sglang-workspace# evalscope eval --model Qwen3-8B --api-url http://127.0.0.1:30000/v1 --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64
2025-08-28 13:50:47,494 - evalscope - INFO - Args: Task config is provided with CommandLine type.
2025-08-28 13:50:49,097 - evalscope - WARNING - Output type generation is not supported for service evaluation. Using server model adapter instead.
2025-08-28 13:50:49,203 - evalscope - INFO - Dump task config to ./outputs/20250828_135047/configs/task_config_311f41.yaml
2025-08-28 13:50:49,206 - evalscope - INFO - {
"model": "Qwen3-8B",
"model_id": "Qwen3-8B",
"model_args": {},
"model_task": "text_generation",
"template_type": null,
"chat_template": null,
"datasets": [
"gsm8k"
],
"dataset_args": {
"gsm8k": {
"name": "gsm8k",
"dataset_id": "modelscope/gsm8k",
"model_adapter": "generation",
"output_types": [
"generation"
],
"subset_list": [
"main"
],
"metric_list": [
"AverageAccuracy"
],
"few_shot_num": 4,
"few_shot_random": false,
"train_split": null,
"eval_split": "test",
"prompt_template": "Question: {query}\nLet's think step by step\nAnswer:",
"system_prompt": null,
"query_template": null,
"pretty_name": "GSM8K",
"description": "GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.",
"tags": [
"Mathematics"
],
"filters": null,
"extra_params": {}
}
},
"dataset_dir": "/root/.cache/modelscope/hub/datasets",
"dataset_hub": "modelscope",
"generation_config": {
"max_tokens": 2048,
"temperature": 0.0
},
"eval_type": "service",
"eval_backend": "Native",
"eval_config": null,
"stage": "all",
"limit": null,
"eval_batch_size": 64,
"mem_cache": false,
"use_cache": null,
"work_dir": "./outputs/20250828_135047",
"outputs": null,
"ignore_errors": false,
"debug": false,
"dry_run": false,
"seed": 42,
"api_url": "http://127.0.0.1:30000/v1",
"api_key": "EMPTY",
"timeout": null,
"stream": false,
"judge_strategy": "auto",
"judge_worker_num": 1,
"judge_model_args": {},
"analysis_report": false
}
2025-08-28 13:50:49,206 - evalscope - INFO - Start evaluating on dataset modelscope/gsm8k
2025-08-28 13:50:49,206 - evalscope - INFO - Loading dataset from hub: modelscope/gsm8k
2025-08-28 13:50:49,419 - evalscope - INFO - Loading dataset: dataset_name: modelscope/gsm8k > subsets: ['main']
Downloading [README.md]: 100%|███████████████████████████████| 406/406 [00:00<00:00, 1.40MB/s]
Downloading [README.md]: 100%|███████████████████████████| 4.14k/4.14k [00:00<00:00, 8.00MB/s]
Downloading data: 100%|██████████████████████████████████| 4.17M/4.17M [00:00<00:00, 8.73MB/s]
Downloading data: 100%|████████████████████████████████████| 750k/750k [00:00<00:00, 2.29MB/s]
Generating train split: 7473 examples [00:00, 36210.59 examples/s]
Generating test split: 1319 examples [00:00, 56779.26 examples/s]
2025-08-28 13:51:01,950 - evalscope - INFO - Use settings: > few_shot_num: 4, > few_shot_split: None, > target_eval_split: test
Predicting(main): 100%|███████████████████████████████████| 1319/1319 [08:56<00:00, 2.46it/s]
2025-08-28 13:59:58,430 - evalscope - INFO - Dump predictions to ./outputs/20250828_135047/predictions/Qwen3-8B/gsm8k_main.jsonl.
Reviewing(main): 100%|██████████████████████████████████| 1319/1319 [00:00<00:00, 2007.85it/s]
2025-08-28 13:59:59,163 - evalscope - INFO -
modelscope/gsm8k report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k | AverageAccuracy | main | 1319 | 0.9075 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+
2025-08-28 13:59:59,163 - evalscope - INFO - Skipping report analysis (`analysis_report=False`).
2025-08-28 13:59:59,164 - evalscope - INFO - Dump report to: ./outputs/20250828_135047/reports/Qwen3-8B/gsm8k.json
2025-08-28 13:59:59,164 - evalscope - INFO - Evaluation finished on modelscope/gsm8k
2025-08-28 13:59:59,168 - evalscope - INFO - Overall report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model | Dataset | Metric | Subset | Num | Score | Cat.0 |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k | AverageAccuracy | main | 1319 | 0.9075 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+
EvalScope 版本(必填)
v0.17.1
使用的工具
- [x] Native / 原生框架
- [ ] Opencompass backend
- [ ] VLMEvalKit backend
- [ ] RAGEval backend
- [ ] Perf / 模型推理压测工具
- [ ] Arena / 竞技场模式
执行的代码或指令
evalscope eval --model Qwen3-8B --api-url http://127.0.0.1:30000/v1 --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64
运行环境
- 操作系统:ubuntu 22.04
- Python版本:3.10.12
你可以看下多次运行的结果,模型在同一个问题上推理的内容是否一样,temperature即使设置为零,输出内容也有可能不一样。此外多次运行的误差在0.1%,我觉得可以接受
好的,我先检查一下 outputs 里的 reviewer 信息。 另外我想问下评测目前是按数据集 index 顺序来评测吗?我看现在两次的 reviewer 结果 index 顺序是不同的,这是因为 batch_size 导致的吗?
好的,我先检查一下 outputs 里的 reviewer 信息。 另外我想问下评测目前是按数据集 index 顺序来评测吗?我看现在两次的 reviewer 结果 index 顺序是不同的,这是因为 batch_size 导致的吗?
是按index评测的,但由于并发返回的顺序不一样,导致看起来乱序