prepare_vocab.py throwing UniDecodeError
If I run prepare_vocab.py for German text corpus, I get the following error:
Traceback (most recent call last): File "prepare_vocab.py", line 41, in
for index, line in enumerate(source_file): File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode return codecs.ascii_decode(input, self.errors)[0] UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 111: ordinal not in range(128)
The command I've run is:
python3 prepare_vocab.py /docker_files/german_ds/text_corpus/German_sentences_8mil_filtered_maryfied.txt /docker_files/german_ds/output/clean_vocab.txt
Solved:
In prepare_vocab.py , just replace line
with open(args.source_path, 'r') as source_file, open(args.target_path, 'w') as target_file:
with
with open(args.source_path, encoding='utf-8', mode='r') as source_file, open(args.target_path, encoding='utf-8', mode='w') as target_file:
The error is because Python3 uses utf-8 decoding while the code was using 'ascii' decoding