COMET
COMET copied to clipboard
Show individual system scores in compare
🚀 Feature
When running comet-compare show the actual COMET score for each system. Is this possible?
Motivation
I usually run the compare to know if a system is the best, statistically significant, but also want the COMET scores for each system.
Alternatives
Show the COMET score for each system
Additional context
Right now I have to run the test twice, one with comet-score and another one with comet-compare.
comet-score --quiet data/fren/wmt14.en -r data/fren/wmt14.en -t hyps/wmt14.base-*.en
hyps/wmt14.base-bergamot.en score: 0.7531
hyps/wmt14.base-wmt-opus.en score: 0.7234
hyps/wmt14.base-wmt.en score: 0.7055
comet-compare -s data/fren/wmt14.en -r data/fren/wmt14.en -t hyps/wmt14.base-*.en
==========================
x_name: hyps/wmt14.base-bergamot.en
y_name: hyps/wmt14.base-wmt-opus.en
Bootstrap Resampling Results:
x-mean: 0.7538
y-mean: 0.7241
ties (%): 0.0000
x_wins (%): 1.0000
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 8.2661
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
hyps/wmt14.base-bergamot.en outperforms hyps/wmt14.base-wmt-opus.en.
==========================
x_name: hyps/wmt14.base-bergamot.en
y_name: hyps/wmt14.base-wmt.en
Bootstrap Resampling Results:
x-mean: 0.7538
y-mean: 0.7066
ties (%): 0.0000
x_wins (%): 1.0000
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 12.2013
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
hyps/wmt14.base-bergamot.en outperforms hyps/wmt14.base-wmt.en.
==========================
x_name: hyps/wmt14.base-wmt-opus.en
y_name: hyps/wmt14.base-wmt.en
Bootstrap Resampling Results:
x-mean: 0.7241
y-mean: 0.7066
ties (%): 0.0000
x_wins (%): 1.0000
y_wins (%): 0.0000
Paired T-Test Results:
statistic: 6.6171
p_value: 0.0000
Null hypothesis rejected according to t-test.
Scores differ significantly across samples.
hyps/wmt14.base-wmt-opus.en outperforms hyps/wmt14.base-wmt.en.
Summary
If system_x is better than system_y then:
Null hypothesis rejected according to t-test with p_value=0.05.
Scores differ significantly across samples.
system_x \ system_y hyps/wmt14.base-bergamot.en hyps/wmt14.base-wmt-opus.en hyps/wmt14.base-wmt.en
--------------------------- ----------------------------- ----------------------------- ------------------------
hyps/wmt14.base-bergamot.en True True
hyps/wmt14.base-wmt-opus.en False True
hyps/wmt14.base-wmt.en False False