Problems with MULTI-GPU scoring
🐛 Bug
Hi, there! I found that there were some problems with MULTI-GPU usage with COMET. When using 4 GPUs the segments seem to be getting unordered, making it impossible to retrieve the scores in the original order.
To Reproduce
I was running the following command:
comet-score -s source.txt -t target.txt –model wmt20-comet-qe-da --gpus 4
where ‘source.txt’ is a .txt file containing the source segments separated with the ‘\n’ character, and ‘target.txt’ a file containing the target segments to be scored, separated in the same way. Since I did not have a reference, I used the QE model.
I ran the scorer on a sample of the source.txt in a CPU, in 1 GPU and in 4 GPUs. The results for CPU and 1 GPU were exactly the same, but the results when using 4 GPUs showed that the scores were unordered. (See below)
(Experiment on 1 GPU) tgt.en Segment 0 score: 0.0000 tgt.en Segment 1 score: 0.0000 tgt.en Segment 2 score: 0.3730 tgt.en Segment 3 score: 0.0002 tgt.en Segment 4 score: 0.2696 tgt.en Segment 5 score: 0.2466 tgt.en Segment 6 score: 0.2901 tgt.en Segment 7 score: 0.2161 tgt.en Segment 8 score: 0.2324 tgt.en Segment 9 score: 0.2536 tgt.en score: 0.1882
(Experiment on 1 GPU) tgt.en Segment 0 score: 0.0000 tgt.en Segment 1 score: 0.0000 tgt.en Segment 2 score: 0.3730 tgt.en Segment 3 score: 0.0002 tgt.en Segment 4 score: 0.2696 tgt.en Segment 5 score: 0.2466 tgt.en Segment 6 score: 0.2901 tgt.en Segment 7 score: 0.2161 tgt.en Segment 8 score: 0.2324 tgt.en Segment 9 score: 0.2536 tgt.en score: 0.1882
(Experiment on 4 GPUs) tgt.en Segment 0 score: 0.0000 tgt.en Segment 1 score: 0.2695 tgt.en Segment 2 score: 0.2323 tgt.en Segment 3 score: 0.0000 tgt.en Segment 4 score: 0.2469 tgt.en Segment 5 score: 0.2536 tgt.en Segment 6 score: 0.3732 tgt.en Segment 7 score: 0.2899 tgt.en Segment 8 score: 0.0002 tgt.en Segment 9 score: 0.2165 tgt.en score: 0.1882
Environment
Experiments were carried out on an AWS EC2 g4dn.xlarge instance and using COMET's 1.0.1 version.
Thank you!
Just to add on this: System-level scores are correct, the problem is that we are currently losing the order of the segment-level scores.
This should not affect system comparisons but it will affect possible segment-by-segment analysis!
I refactored the multiGPU inference.
This issue is the same as #101.
Fix will be merged in next release.