prepare_vocab.py throwing UniDecodeError

Open rpratesh opened this issue 6 years ago • 1 comments

If I run prepare_vocab.py for German text corpus, I get the following error:

Traceback (most recent call last): File "prepare_vocab.py", line 41, in for index, line in enumerate(source_file): File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 111: ordinal not in range(128)

The command I've run is:

python3 prepare_vocab.py /docker_files/german_ds/text_corpus/German_sentences_8mil_filtered_maryfied.txt /docker_files/german_ds/output/clean_vocab.txt

Feb 25 '19 08:02 rpratesh

Solved: In prepare_vocab.py , just replace line with open(args.source_path, 'r') as source_file, open(args.target_path, 'w') as target_file: with with open(args.source_path, encoding='utf-8', mode='r') as source_file, open(args.target_path, encoding='utf-8', mode='w') as target_file:

The error is because Python3 uses utf-8 decoding while the code was using 'ascii' decoding

Feb 25 '19 11:02 rpratesh