FRIDAY icon indicating copy to clipboard operation
FRIDAY copied to clipboard

[General]: 如何运行GAIA的全部测试用例?

Open HuangDuoYan opened this issue 6 months ago • 1 comments

“疑问”部分见最后。

环境

  • Fedora 42 Workstation
  • Conda, Python v3.10 .env文件内容:
MODEL_NAME="gpt-4o-mini"
MODEL_TYPE="OpenAI" # if use the gpt series (OpenAI), lamma series (LLAMA)
OPENAI_API_KEY="key" # yourapi.cn
OPENAI_ORGANIZATION="org"
API_BASE_URL="http://127.0.0.1:8079" # turn it on if using a VPN
OPENAI_BASE_URL="https://api.yourapi.cn/v1"
BING_SUBSCRIPTION_KEY="123" #停服了,这里只是占位
BING_SEARCH_URL="https://api.bing.microsoft.com/v7.0/search"
WOLFRAMALPHA_APP_ID="id"

步骤

首先配置好代理(通过终端的首选项->配置文件),激活conda环境,(当然事先要pip install -e .)再切换到仓库根目录,必须先运行:

touch gaia_gpt4-turbo_validation_level1_results.jsonl

不知道为什么,如果没有执行这个,稍后就会提示“找不到文件”。例如,我重新克隆仓库,直接运行的话就会这样:

Image

然后按照https://github.com/OS-Copilot/OS-Copilot/issues/69 ,修改代码,否则会出现No JSON data found in the string.。在这里我也提一个问题:为什么实验团队的机器上面就没问题呢?

再运行:

python examples/GAIA/run_GAIA.py

期间,仍然可能出现No JSON data found in the string.,但是性质与https://github.com/OS-Copilot/OS-Copilot/issues/69 不同,这里是因为AI认为问题过于简单,“忘记”或者“没必要”生成JSON。不过还有一种情况,如下图所示: Image 此时,https://github.com/OS-Copilot/OS-Copilot/issues/69#issuecomment-3157481231 提到的两种正则表达式都是错误的,因为它们都假定两个大括号间一定有个换行。 我的操作是,反复执行,直到出现这种结果为止:

			 skip current run: 0
			 skip current run: 1
			 skip current run: 2
			 skip current run: 3
			 skip current run: 4
			 skip current run: 5
			 skip current run: 6
			 skip current run: 7
			 skip current run: 8
			 skip current run: 9
			 skip current run: 10
			 skip current run: 11
			 skip current run: 12
			 skip current run: 13
			 skip current run: 14
			 skip current run: 15
			 skip current run: 16
			 skip current run: 17
			 skip current run: 18
			 skip current run: 19
			 skip current run: 20
			 skip current run: 21
			 skip current run: 22
			 skip current run: 23
			 skip current run: 24
			 skip current run: 25
			 skip current run: 26
			 skip current run: 27
			 skip current run: 28
			 skip current run: 29
			 skip current run: 30
			 skip current run: 31
			 skip current run: 32
			 skip current run: 33
			 skip current run: 34
			 skip current run: 35
			 skip current run: 36
			 skip current run: 37
			 skip current run: 38
			 skip current run: 39
			 skip current run: 40
			 skip current run: 41
			 skip current run: 42
			 skip current run: 43
			 skip current run: 44
			 skip current run: 45
			 skip current run: 46
			 skip current run: 47
			 skip current run: 48
			 skip current run: 49
			 skip current run: 50
			 skip current run: 51
			 skip current run: 52
accuracy: 0.05660377358490566
incomplete: 0.5471698113207547
correct incomplete total, 3 29 53

好,到这一步意味着已经出结果了,正确率都有了。但是很不对劲:https://huggingface.co/datasets/gaia-benchmark/GAIA 提到GAIA有450+数据,这才……53个。先不说正确个数低好吗,因为现在必应搜索API停服了,那个搜索功能没法用,显然正确率会打折扣,除非我去学学如何使用百度搜索API(https://cloud.baidu.com/doc/AppBuilder/s/pmaxd1hvy )。

疑问

所以我的疑问就是如何跑遍450+的所有用例?

HuangDuoYan avatar Aug 06 '25 06:08 HuangDuoYan

发现没办法一次性跑遍。

首先运行:

python examples/GAIA/run_GAIA.py --help

发现:

  --dataset_type DATASET_TYPE
                        Defines the type of dataset to use, either `validation` for development or `test` for testing purposes

发现https://github.com/OS-Copilot/OS-Copilot/blob/f720af8807e49a92dda64572d2c6bc6c0ac7ee7e/examples/GAIA/run_GAIA.py#L9 这里给限制成公开的validation数据集了。考虑注释掉。 对公开数据集:

python examples/GAIA/run_GAIA.py --dataset_type validation --level 1

level必须从1到3,当然必须先touch3遍。 如果是私有数据集,还没试,估计是:

python examples/GAIA/run_GAIA.py --dataset_type test --level 1

HuangDuoYan avatar Aug 06 '25 14:08 HuangDuoYan