[General]: 如何运行GAIA的全部测试用例?
“疑问”部分见最后。
环境
- Fedora 42 Workstation
- Conda, Python v3.10 .env文件内容:
MODEL_NAME="gpt-4o-mini"
MODEL_TYPE="OpenAI" # if use the gpt series (OpenAI), lamma series (LLAMA)
OPENAI_API_KEY="key" # yourapi.cn
OPENAI_ORGANIZATION="org"
API_BASE_URL="http://127.0.0.1:8079" # turn it on if using a VPN
OPENAI_BASE_URL="https://api.yourapi.cn/v1"
BING_SUBSCRIPTION_KEY="123" #停服了,这里只是占位
BING_SEARCH_URL="https://api.bing.microsoft.com/v7.0/search"
WOLFRAMALPHA_APP_ID="id"
步骤
首先配置好代理(通过终端的首选项->配置文件),激活conda环境,(当然事先要pip install -e .)再切换到仓库根目录,必须先运行:
touch gaia_gpt4-turbo_validation_level1_results.jsonl
不知道为什么,如果没有执行这个,稍后就会提示“找不到文件”。例如,我重新克隆仓库,直接运行的话就会这样:
然后按照https://github.com/OS-Copilot/OS-Copilot/issues/69 ,修改代码,否则会出现No JSON data found in the string.。在这里我也提一个问题:为什么实验团队的机器上面就没问题呢?
再运行:
python examples/GAIA/run_GAIA.py
期间,仍然可能出现No JSON data found in the string.,但是性质与https://github.com/OS-Copilot/OS-Copilot/issues/69 不同,这里是因为AI认为问题过于简单,“忘记”或者“没必要”生成JSON。不过还有一种情况,如下图所示:
此时,https://github.com/OS-Copilot/OS-Copilot/issues/69#issuecomment-3157481231 提到的两种正则表达式都是错误的,因为它们都假定两个大括号间一定有个换行。
我的操作是,反复执行,直到出现这种结果为止:
skip current run: 0
skip current run: 1
skip current run: 2
skip current run: 3
skip current run: 4
skip current run: 5
skip current run: 6
skip current run: 7
skip current run: 8
skip current run: 9
skip current run: 10
skip current run: 11
skip current run: 12
skip current run: 13
skip current run: 14
skip current run: 15
skip current run: 16
skip current run: 17
skip current run: 18
skip current run: 19
skip current run: 20
skip current run: 21
skip current run: 22
skip current run: 23
skip current run: 24
skip current run: 25
skip current run: 26
skip current run: 27
skip current run: 28
skip current run: 29
skip current run: 30
skip current run: 31
skip current run: 32
skip current run: 33
skip current run: 34
skip current run: 35
skip current run: 36
skip current run: 37
skip current run: 38
skip current run: 39
skip current run: 40
skip current run: 41
skip current run: 42
skip current run: 43
skip current run: 44
skip current run: 45
skip current run: 46
skip current run: 47
skip current run: 48
skip current run: 49
skip current run: 50
skip current run: 51
skip current run: 52
accuracy: 0.05660377358490566
incomplete: 0.5471698113207547
correct incomplete total, 3 29 53
好,到这一步意味着已经出结果了,正确率都有了。但是很不对劲:https://huggingface.co/datasets/gaia-benchmark/GAIA 提到GAIA有450+数据,这才……53个。先不说正确个数低好吗,因为现在必应搜索API停服了,那个搜索功能没法用,显然正确率会打折扣,除非我去学学如何使用百度搜索API(https://cloud.baidu.com/doc/AppBuilder/s/pmaxd1hvy )。
疑问
所以我的疑问就是如何跑遍450+的所有用例?
发现没办法一次性跑遍。
首先运行:
python examples/GAIA/run_GAIA.py --help
发现:
--dataset_type DATASET_TYPE
Defines the type of dataset to use, either `validation` for development or `test` for testing purposes
发现https://github.com/OS-Copilot/OS-Copilot/blob/f720af8807e49a92dda64572d2c6bc6c0ac7ee7e/examples/GAIA/run_GAIA.py#L9 这里给限制成公开的validation数据集了。考虑注释掉。 对公开数据集:
python examples/GAIA/run_GAIA.py --dataset_type validation --level 1
level必须从1到3,当然必须先touch3遍。
如果是私有数据集,还没试,估计是:
python examples/GAIA/run_GAIA.py --dataset_type test --level 1