UnicodeDecodeError: 'cp949' codec can't decode bytes
I'm getting this error on some specific rtf files.
stack trace:
File "/Library/Python/2.7/site-packages/textract/parsers/init.py", line 57, in process return parser.process(filename, encoding, **kwargs) File "/Library/Python/2.7/site-packages/textract/parsers/utils.py", line 45, in process unicode_string = self.decode(byte_string) File "/Library/Python/2.7/site-packages/textract/parsers/utils.py", line 64, in decode return text.decode(result['encoding'])
e.g. attached rtf-file (zipped) PARTNERSHIP INTEREST PURCHASE AGREEMENT.rtf.zip
Thank you for providing the example! I am pretty sure this is a chardet version problem. I was able to successfully extract the text from your file when I pip install chardet==2.1.1. I am going to pin chardet to that version until the issue is resolved; hopefully that fixes the issue for you!
Bummer. Rolling chardet back to 2.1.1 will work with py2 but it does not work with py3. I'm going to leave this open until https://github.com/chardet/chardet/issues/98 is resolved. This issue will serve as documentation of the workaround for py2 users in the meantime.
i was having the same error .in my ubuntu . I just installed this
sudo apt install unoconv .
and used this tool to convert doc to docx .(used exception handling).