COMET icon indicating copy to clipboard operation
COMET copied to clipboard

Show individual system scores in compare

Open ZJaume opened this issue 3 years ago • 0 comments

🚀 Feature

When running comet-compare show the actual COMET score for each system. Is this possible?

Motivation

I usually run the compare to know if a system is the best, statistically significant, but also want the COMET scores for each system.

Alternatives

Show the COMET score for each system

Additional context

Right now I have to run the test twice, one with comet-score and another one with comet-compare.

comet-score --quiet data/fren/wmt14.en -r data/fren/wmt14.en -t hyps/wmt14.base-*.en
hyps/wmt14.base-bergamot.en     score: 0.7531
hyps/wmt14.base-wmt-opus.en     score: 0.7234
hyps/wmt14.base-wmt.en  score: 0.7055
comet-compare -s data/fren/wmt14.en -r data/fren/wmt14.en -t hyps/wmt14.base-*.en
==========================
x_name: hyps/wmt14.base-bergamot.en
y_name: hyps/wmt14.base-wmt-opus.en

Bootstrap Resampling Results:
x-mean: 0.7538
y-mean: 0.7241
ties (%):       0.0000
x_wins (%):     1.0000
y_wins (%):     0.0000

Paired T-Test Results:
statistic:      8.2661
p_value:        0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
hyps/wmt14.base-bergamot.en outperforms hyps/wmt14.base-wmt-opus.en.
==========================
x_name: hyps/wmt14.base-bergamot.en
y_name: hyps/wmt14.base-wmt.en

Bootstrap Resampling Results:
x-mean: 0.7538
y-mean: 0.7066
ties (%):       0.0000
x_wins (%):     1.0000
y_wins (%):     0.0000

Paired T-Test Results:
statistic:      12.2013
p_value:        0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
hyps/wmt14.base-bergamot.en outperforms hyps/wmt14.base-wmt.en.
==========================
x_name: hyps/wmt14.base-wmt-opus.en
y_name: hyps/wmt14.base-wmt.en

Bootstrap Resampling Results:
x-mean: 0.7241
y-mean: 0.7066
ties (%):       0.0000
x_wins (%):     1.0000
y_wins (%):     0.0000

Paired T-Test Results:
statistic:      6.6171
p_value:        0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
hyps/wmt14.base-wmt-opus.en outperforms hyps/wmt14.base-wmt.en.

Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y          hyps/wmt14.base-bergamot.en    hyps/wmt14.base-wmt-opus.en    hyps/wmt14.base-wmt.en
---------------------------  -----------------------------  -----------------------------  ------------------------
hyps/wmt14.base-bergamot.en                                 True                           True
hyps/wmt14.base-wmt-opus.en  False                                                         True
hyps/wmt14.base-wmt.en       False                          False

ZJaume avatar Aug 18 '22 15:08 ZJaume