Benchmark results can not be reproduced

Open waneon opened this issue 2 years ago • 1 comments

I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0 Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?

hf (pretrained=OpenNLPLab/TransNormerLLM-385M,trust_remote_code=True), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|Filter|n-shot|Metric|Value |   |Stderr|
|-----|-------|------|-----:|------|-----:|---|-----:|
|boolq|Yaml   |none  |     0|acc   |0.4859|±  |0.0087|

Jan 24 '24 08:01 waneon

Hello, there are some minor bugs in the current model's testing, and we are currently fixing them. For now, you can resolve this issue by adding the following command.

export do_eval=True
export use_triton=False

If this still cannot resolve the issue, please feel free to ask at any time.

Jan 24 '24 08:01 Doraemonzzz