polyglot icon indicating copy to clipboard operation
polyglot copied to clipboard

Cannot reproduce the evaluation score of HellaSwag, WiC

Open rycont opened this issue 2 years ago • 0 comments

I evaluated polyglot-ko-1.3b model with HellaSwag and WiC from KoBEST, and I got different results with paper and model card from huggingface.

Environment

  • Few-shot examples: 5
  • Model: EleutherAI/polyglot-ko-1.3b
  • Metrics: F1(Macro) Score
  • Computing: Colab / GPU(T4) Instance

I'm going to share a notebook that I tested with. https://colab.research.google.com/drive/1lyQQisuB5JzuGk72haSdxXfXP20q4YGr?usp=sharing

1. WiC

The paper says the score 0.486, But I got only 0.4541.

  • The paper
params 0-shot 5-shot 10-shot 50-shot
1.3B 0.489 0.486 0.506 0.487
  • In my test

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task Version Metric Value Stderr
kobest_wic 0 acc 0.4952 ± 0.0141
macro_f1 0.4541 ± 0.0138

2. HellaSwag

The paper says the score 0.526, But I got only 0.3984.

  • In the paper
params 0-shot 5-shot 10-shot 50-shot
1.3B 0.525 0.526 0.528 0.543
  • In my test

hf-causal-experimental (pretrained=EleutherAI/polyglot-ko-1.3b), limit: None, provide_description: False, num_fewshot: 5, batch_size: 8

Task Version Metric Value Stderr
kobest_hellaswag 0 acc 0.4020 ± 0.0219
acc_norm 0.5280 ± 0.0223
macro_f1 0.3984 ± 0.0218

And I found out a Wandb Report Polyglot-Ko: Open-Source Korean Autoregressive Language Model , And there's a HellaSwag score that is same as my test, 0.3984.

params n=0 n=5 n=10 n=50
1.3B 0.4013 0.3984 0.417 0.4416

In case of other models

There are also differences in kakaobrain/kogpt and skt/ko-gpt-trinity-1.2B-v0.5.

  • kakaobrain/kogpt Note that I tested kakaobrain/kogpt with Int 8 quantized model.
In the paper (FP16) In my test (Int8) In the Wandb Report
CoPA 0.7287 0.7277 (↓0.01%) 0.7287
HellaSwag 0.5833 0.4560 (↓21.82%) 0.456
BoolQ 0.5981 0.6015 (↑0.56%) -
WiC 0.4775 0.3706 (↓22.38%) -
  • skt/ko-gpt-trinity-1.2B-v0.5
In the paper In my test In the Wandb Report
WiC 0.4313 0.3953 -
HellaSwag 0.5272 0.400 0.4

rycont avatar Nov 29 '23 11:11 rycont