COMET
COMET copied to clipboard
Force encodings to match across commands
🐛 Bug
Encoding of io operations is done inconsistently. Scoring is performed expecting utf-8, but MBR decoding is not (see lines 188 and 191), and compare only sometimes (see lines 439 and 446). Using these commands in succession can throw encoding errors.
To Reproduce
comet-mbr -s "./nmt_adapt/translations/${TRANSLATIONS_FILE}/source.txt" -t "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_samples.txt" --num_samples 12 -o "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_comet.txt"
comet-score -s "./nmt_adapt/translations/${TRANSLATIONS_FILE}/source.txt" -t "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_comet.txt" -r "./nmt_adapt/translations/${TRANSLATIONS_FILE}/references.txt" --quiet
Here all files except for "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_comet.txt" are explicitly in utf-8, and throw no issues when using comet-score.
Throws error:
Traceback (most recent call last):
File "/home/ivov/.conda/envs/nmt_eval/bin/comet-score", line 8, in <module>
sys.exit(score_command())
File "/home/ivov/.conda/envs/nmt_eval/lib/python3.9/site-packages/comet/cli/score.py", line 198, in score_command
translations.append([line.strip() for line in fp.readlines()])
File "/home/ivov/.conda/envs/nmt_eval/lib/python3.9/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5824: invalid continuation byte
The system's default encoding is latin-1, but can't be certain.
Expected behaviour
Scoring to continue as usual.
Screenshots
If applicable, add screenshots to help explain your problem.
Environment
OS: Linux-4.19.0-20-amd64-x86_64-with-glibc2.28 Packaging: pip Version: 1.1.0