textract icon indicating copy to clipboard operation
textract copied to clipboard

UnicodeDecodeError: 'cp949' codec can't decode bytes

Open askemottelson opened this issue 10 years ago • 3 comments

I'm getting this error on some specific rtf files.

stack trace:

File "/Library/Python/2.7/site-packages/textract/parsers/init.py", line 57, in process return parser.process(filename, encoding, **kwargs) File "/Library/Python/2.7/site-packages/textract/parsers/utils.py", line 45, in process unicode_string = self.decode(byte_string) File "/Library/Python/2.7/site-packages/textract/parsers/utils.py", line 64, in decode return text.decode(result['encoding'])

e.g. attached rtf-file (zipped) PARTNERSHIP INTEREST PURCHASE AGREEMENT.rtf.zip

askemottelson avatar Mar 23 '16 14:03 askemottelson

Thank you for providing the example! I am pretty sure this is a chardet version problem. I was able to successfully extract the text from your file when I pip install chardet==2.1.1. I am going to pin chardet to that version until the issue is resolved; hopefully that fixes the issue for you!

deanmalmgren avatar Mar 24 '17 12:03 deanmalmgren

Bummer. Rolling chardet back to 2.1.1 will work with py2 but it does not work with py3. I'm going to leave this open until https://github.com/chardet/chardet/issues/98 is resolved. This issue will serve as documentation of the workaround for py2 users in the meantime.

deanmalmgren avatar Mar 28 '17 11:03 deanmalmgren

i was having the same error .in my ubuntu . I just installed this sudo apt install unoconv . and used this tool to convert doc to docx .(used exception handling).

mohammedyunus009 avatar Oct 18 '18 09:10 mohammedyunus009