eval-scope 在相同参数下进行 gsm8k 评测，有精度差异

自查清单

在提交 issue 之前，请确保您已完成以下步骤:

[x] 我已仔细阅读了相关使用说明文档
[x] 我已查看了常见问题解答
[x] 我已搜索并查看了现有的 issues，确认这不是一个重复的问题

问题描述

在使用 Sglang 后端 + Evalscope 进行 Qwen3-8B 推理评测的时候，多次运行 gsm8k 的精度结果有差异第一次

2025-08-29 07:48:40,750 - evalscope - INFO - Args: Task config is provided with CommandLine type.
2025-08-29 07:48:42,247 - evalscope - WARNING - Output type generation is not supported for service evaluation. Using server model adapter instead.
2025-08-29 07:48:42,349 - evalscope - INFO - Dump task config to ./outputs/20250829_074840/configs/task_config_4aafbd.yaml
2025-08-29 07:48:42,352 - evalscope - INFO - {
    "model": "Qwen3-8B",
    "model_id": "Qwen3-8B",
    "model_args": {},
    "model_task": "text_generation",
    "template_type": null,
    "chat_template": null,
    "datasets": [
        "gsm8k"
    ],
    "dataset_args": {
        "gsm8k": {
            "name": "gsm8k",
            "dataset_id": "modelscope/gsm8k",
            "model_adapter": "generation",
            "output_types": [
                "generation"
            ],
            "subset_list": [
                "main"
            ],
            "metric_list": [
                "AverageAccuracy"
            ],
            "few_shot_num": 4,
            "few_shot_random": false,
            "train_split": null,
            "eval_split": "test",
            "prompt_template": "Question: {query}\nLet's think step by step\nAnswer:",
            "system_prompt": null,
            "query_template": null,
            "pretty_name": "GSM8K",
            "description": "GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.",
            "tags": [
                "Mathematics"
            ],
            "filters": null,
            "extra_params": {}
        }
    },
    "dataset_dir": "/root/.cache/modelscope/hub/datasets",
    "dataset_hub": "modelscope",
    "generation_config": {
        "max_tokens": 2048,
        "temperature": 0.0
    },
    "eval_type": "service",
    "eval_backend": "Native",
    "eval_config": null,
    "stage": "all",
    "limit": null,
    "eval_batch_size": 512,
    "mem_cache": false,
    "use_cache": null,
    "work_dir": "./outputs/20250829_074840",
    "outputs": null,
    "ignore_errors": false,
    "debug": false,
    "dry_run": false,
    "seed": 42,
    "api_url": "http://127.0.0.1:30000/v1",
    "api_key": "EMPTY",
    "timeout": null,
    "stream": false,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {},
    "analysis_report": false
}
2025-08-29 07:48:42,352 - evalscope - INFO - Start evaluating on dataset modelscope/gsm8k
2025-08-29 07:48:42,352 - evalscope - INFO - Loading dataset from hub: modelscope/gsm8k
2025-08-29 07:48:42,537 - evalscope - INFO - Loading dataset: dataset_name: modelscope/gsm8k > subsets: ['main']
2025-08-29 07:48:49,503 - evalscope - INFO - Use settings: > few_shot_num: 4, > few_shot_split: None, > target_eval_split: test
Predicting(main): 100%|███████████████████████████████████| 1319/1319 [05:15<00:00,  4.17it/s]
2025-08-29 07:54:05,607 - evalscope - INFO - Dump predictions to ./outputs/20250829_074840/predictions/Qwen3-8B/gsm8k_main.jsonl.
Reviewing(main): 100%|██████████████████████████████████| 1319/1319 [00:00<00:00, 2413.41it/s]
2025-08-29 07:54:06,238 - evalscope - INFO -
modelscope/gsm8k report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |   0.906 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

2025-08-29 07:54:06,238 - evalscope - INFO - Skipping report analysis (`analysis_report=False`).
2025-08-29 07:54:06,239 - evalscope - INFO - Dump report to: ./outputs/20250829_074840/reports/Qwen3-8B/gsm8k.json

2025-08-29 07:54:06,239 - evalscope - INFO - Evaluation finished on modelscope/gsm8k
2025-08-29 07:54:06,247 - evalscope - INFO - Overall report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |   0.906 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

第二次

root@yy-gpu-prefill-0:/sglang-workspace# evalscope eval  --model Qwen3-8B --api-url http://127.0.0.1:30000/v1 --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64
2025-08-28 13:50:47,494 - evalscope - INFO - Args: Task config is provided with CommandLine type.
2025-08-28 13:50:49,097 - evalscope - WARNING - Output type generation is not supported for service evaluation. Using server model adapter instead.
2025-08-28 13:50:49,203 - evalscope - INFO - Dump task config to ./outputs/20250828_135047/configs/task_config_311f41.yaml
2025-08-28 13:50:49,206 - evalscope - INFO - {
    "model": "Qwen3-8B",
    "model_id": "Qwen3-8B",
    "model_args": {},
    "model_task": "text_generation",
    "template_type": null,
    "chat_template": null,
    "datasets": [
        "gsm8k"
    ],
    "dataset_args": {
        "gsm8k": {
            "name": "gsm8k",
            "dataset_id": "modelscope/gsm8k",
            "model_adapter": "generation",
            "output_types": [
                "generation"
            ],
            "subset_list": [
                "main"
            ],
            "metric_list": [
                "AverageAccuracy"
            ],
            "few_shot_num": 4,
            "few_shot_random": false,
            "train_split": null,
            "eval_split": "test",
            "prompt_template": "Question: {query}\nLet's think step by step\nAnswer:",
            "system_prompt": null,
            "query_template": null,
            "pretty_name": "GSM8K",
            "description": "GSM8K (Grade School Math 8K) is a dataset of grade school math problems, designed to evaluate the mathematical reasoning abilities of AI models.",
            "tags": [
                "Mathematics"
            ],
            "filters": null,
            "extra_params": {}
        }
    },
    "dataset_dir": "/root/.cache/modelscope/hub/datasets",
    "dataset_hub": "modelscope",
    "generation_config": {
        "max_tokens": 2048,
        "temperature": 0.0
    },
    "eval_type": "service",
    "eval_backend": "Native",
    "eval_config": null,
    "stage": "all",
    "limit": null,
    "eval_batch_size": 64,
    "mem_cache": false,
    "use_cache": null,
    "work_dir": "./outputs/20250828_135047",
    "outputs": null,
    "ignore_errors": false,
    "debug": false,
    "dry_run": false,
    "seed": 42,
    "api_url": "http://127.0.0.1:30000/v1",
    "api_key": "EMPTY",
    "timeout": null,
    "stream": false,
    "judge_strategy": "auto",
    "judge_worker_num": 1,
    "judge_model_args": {},
    "analysis_report": false
}
2025-08-28 13:50:49,206 - evalscope - INFO - Start evaluating on dataset modelscope/gsm8k
2025-08-28 13:50:49,206 - evalscope - INFO - Loading dataset from hub: modelscope/gsm8k
2025-08-28 13:50:49,419 - evalscope - INFO - Loading dataset: dataset_name: modelscope/gsm8k > subsets: ['main']
Downloading [README.md]: 100%|███████████████████████████████| 406/406 [00:00<00:00, 1.40MB/s]
Downloading [README.md]: 100%|███████████████████████████| 4.14k/4.14k [00:00<00:00, 8.00MB/s]
Downloading data: 100%|██████████████████████████████████| 4.17M/4.17M [00:00<00:00, 8.73MB/s]
Downloading data: 100%|████████████████████████████████████| 750k/750k [00:00<00:00, 2.29MB/s]
Generating train split: 7473 examples [00:00, 36210.59 examples/s]
Generating test split: 1319 examples [00:00, 56779.26 examples/s]
2025-08-28 13:51:01,950 - evalscope - INFO - Use settings: > few_shot_num: 4, > few_shot_split: None, > target_eval_split: test
Predicting(main): 100%|███████████████████████████████████| 1319/1319 [08:56<00:00,  2.46it/s]
2025-08-28 13:59:58,430 - evalscope - INFO - Dump predictions to ./outputs/20250828_135047/predictions/Qwen3-8B/gsm8k_main.jsonl.
Reviewing(main): 100%|██████████████████████████████████| 1319/1319 [00:00<00:00, 2007.85it/s]
2025-08-28 13:59:59,163 - evalscope - INFO -
modelscope/gsm8k report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |  0.9075 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

2025-08-28 13:59:59,163 - evalscope - INFO - Skipping report analysis (`analysis_report=False`).
2025-08-28 13:59:59,164 - evalscope - INFO - Dump report to: ./outputs/20250828_135047/reports/Qwen3-8B/gsm8k.json

2025-08-28 13:59:59,164 - evalscope - INFO - Evaluation finished on modelscope/gsm8k
2025-08-28 13:59:59,168 - evalscope - INFO - Overall report table:
+----------+-----------+-----------------+----------+-------+---------+---------+
| Model    | Dataset   | Metric          | Subset   |   Num |   Score | Cat.0   |
+==========+===========+=================+==========+=======+=========+=========+
| Qwen3-8B | gsm8k     | AverageAccuracy | main     |  1319 |  0.9075 | default |
+----------+-----------+-----------------+----------+-------+---------+---------+

EvalScope 版本（必填）

v0.17.1

使用的工具

[x] Native / 原生框架
[ ] Opencompass backend
[ ] VLMEvalKit backend
[ ] RAGEval backend
[ ] Perf / 模型推理压测工具
[ ] Arena / 竞技场模式

执行的代码或指令

evalscope eval --model Qwen3-8B --api-url http://127.0.0.1:30000/v1 --api-key EMPTY --eval-type service --datasets gsm8k --eval-batch-size 64

运行环境

操作系统：ubuntu 22.04
Python版本：3.10.12

Aug 29 '25 08:08 Prayer3th

你可以看下多次运行的结果，模型在同一个问题上推理的内容是否一样，temperature即使设置为零，输出内容也有可能不一样。此外多次运行的误差在0.1%，我觉得可以接受

Aug 29 '25 09:08 Yunnglin

好的，我先检查一下 outputs 里的 reviewer 信息。另外我想问下评测目前是按数据集 index 顺序来评测吗？我看现在两次的 reviewer 结果 index 顺序是不同的，这是因为 batch_size 导致的吗？

Aug 29 '25 09:08 Prayer3th

好的，我先检查一下 outputs 里的 reviewer 信息。另外我想问下评测目前是按数据集 index 顺序来评测吗？我看现在两次的 reviewer 结果 index 顺序是不同的，这是因为 batch_size 导致的吗？

是按index评测的，但由于并发返回的顺序不一样，导致看起来乱序

Aug 29 '25 10:08 Yunnglin