TransnormerLLM icon indicating copy to clipboard operation
TransnormerLLM copied to clipboard

Benchmark results can not be reproduced

Open waneon opened this issue 2 years ago • 1 comments

I tested transnormerllm-385m with llm-eval-harness for boolq benchmark. However, the result is not aligned to that result you have reported. As well as boolq benchmark, and 385m model, other benchmarks and models also can not be reproduced, showing significantly lowered result. I tested it with harness v0.4.0 Could I have possibly made a mistake in measuring my benchmark? Could you please share with me the script used for measuring the benchmark?

hf (pretrained=OpenNLPLab/TransNormerLLM-385M,trust_remote_code=True), gen_kwargs: (), limit: None, num_fewshot: None, batch_size: 4
|Tasks|Version|Filter|n-shot|Metric|Value |   |Stderr|
|-----|-------|------|-----:|------|-----:|---|-----:|
|boolq|Yaml   |none  |     0|acc   |0.4859|±  |0.0087|

waneon avatar Jan 24 '24 08:01 waneon

Hello, there are some minor bugs in the current model's testing, and we are currently fixing them. For now, you can resolve this issue by adding the following command.

export do_eval=True
export use_triton=False

If this still cannot resolve the issue, please feel free to ask at any time.

Doraemonzzz avatar Jan 24 '24 08:01 Doraemonzzz