Force encodings to match across commands

Open ioverho opened this issue 3 years ago • 0 comments

🐛 Bug

Encoding of io operations is done inconsistently. Scoring is performed expecting utf-8, but MBR decoding is not (see lines 188 and 191), and compare only sometimes (see lines 439 and 446). Using these commands in succession can throw encoding errors.

To Reproduce

comet-mbr -s "./nmt_adapt/translations/${TRANSLATIONS_FILE}/source.txt" -t "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_samples.txt" --num_samples 12 -o "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_comet.txt"

comet-score -s "./nmt_adapt/translations/${TRANSLATIONS_FILE}/source.txt" -t "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_comet.txt" -r "./nmt_adapt/translations/${TRANSLATIONS_FILE}/references.txt" --quiet

Here all files except for "./nmt_adapt/translations/${TRANSLATIONS_FILE}/mbr_comet.txt" are explicitly in utf-8, and throw no issues when using comet-score.

Throws error:

Traceback (most recent call last):
  File "/home/ivov/.conda/envs/nmt_eval/bin/comet-score", line 8, in <module>
    sys.exit(score_command())
  File "/home/ivov/.conda/envs/nmt_eval/lib/python3.9/site-packages/comet/cli/score.py", line 198, in score_command
    translations.append([line.strip() for line in fp.readlines()])
  File "/home/ivov/.conda/envs/nmt_eval/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 5824: invalid continuation byte

The system's default encoding is latin-1, but can't be certain.

Expected behaviour

Scoring to continue as usual.

Screenshots

If applicable, add screenshots to help explain your problem.

Environment

OS: Linux-4.19.0-20-amd64-x86_64-with-glibc2.28 Packaging: pip Version: 1.1.0

Jun 10 '22 10:06 ioverho