Cannot reproduce the evaluation score of HellaSwag, WiC

Open rycont opened this issue 2 years ago • 0 comments

I evaluated polyglot-ko-1.3b model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.

The paper says the score 0.486, But I got only 0.4541.

params	0-shot	5-shot	10-shot	50-shot
1.3B	0.489	0.486	0.506	0.487

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task	Version	Metric	Value		Stderr
kobest_wic	0	acc	0.4952	±	0.0141
		macro_f1	0.4541	±	0.0138

The paper says the score 0.526, But I got only 0.3984.

params	0-shot	5-shot	10-shot	50-shot
1.3B	0.525	0.526	0.528	0.543

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task	Version	Metric	Value		Stderr
kobest_hellaswag	0	acc	0.4020	±	0.0219
		acc_norm	0.5280	±	0.0223
		macro_f1	0.3984	±	0.0218

And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model , And there's a HellaSwag score that is same as my test, 0.3984.

params	n=0	n=5	n=10	n=50
1.3B	0.4013	0.3984	0.417	0.4416

There are also differences in kakaobrain/kogpt and skt/ko-gpt-trinity-1.2B-v0.5.

kakaobrain/kogpt Note that I tested kakaobrain/kogpt with Int 8 quantized model.

	In the paper	In my test	In the Wandb Report
WiC	0.4313	0.3953	-
HellaSwag	0.5272	0.400	0.4

Nov 29 '23 11:11 rycont