markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Crashes on every file i tested (more than 100) with UnicodeEncodeError error.

Open ruslankiskinov opened this issue 1 year ago • 3 comments

For every PDF file I tested the tool crashes with whatever UnicodeEncodeError. In every file it finds a different character to crash on. The problem is that it didn't even try to skip the character just crashed and the output is empty which makes the tool useless. I tested with files in Cyrillic, French, and German, and some files in English too. If the file is extremely simple it is able to convert it.

Unfortunately, I can't expose these examples here.

Environment: Windows / Python 3.12 Error: UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 1809: character maps to

Traceback (most recent call last): File "", line 198, in run_module_as_main File "", line 88, in run_code File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main print(result.text_content) File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 1809: character maps to

OR for an Excel file:

Traceback (most recent call last): File "", line 198, in run_module_as_main File "", line 88, in run_code File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main print(result.text_content) File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\u0144' in position 296: character maps to

I can provide an example with the manual of my SONY headphones: https://www.sony.com/electronics/support/res/manuals/4559/45598331M.pdf

Here is the error: Traceback (most recent call last): File "", line 198, in run_module_as_main File "", line 88, in run_code File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main print(result.text_content) File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\xe7' in position 23: character maps to

I don't know why it tries to use CP1251 codepage as the file is PDF with no Cyrillic content in it.

ruslankiskinov avatar Jan 17 '25 15:01 ruslankiskinov

Try this as a workaround:

>chcp 65001
Active code page: 65001

>set PYTHONIOENCODING=utf-8

>markitdown my_document.pdf > my_document.md

kristofmulier avatar Jan 18 '25 16:01 kristofmulier

Try this as a workaround:

chcp 65001 Active code page: 65001

set PYTHONIOENCODING=utf-8

markitdown my_document.pdf > my_document.md

This workaround unfortunately didn't work: ❯ chcp 65001 Active code page: 65001 ❯ set PYTHONIOENCODING=utf-8 ❯ markitdown t:\Downloads\45598331M.pdf > t:\sony.md Traceback (most recent call last): File "", line 198, in run_module_as_main File "", line 88, in run_code File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main print(result.text_content) File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ UnicodeEncodeError: 'charmap' codec can't encode character '\xe7' in position 23: character maps to

I also tried other code pages such as 1252, 1250, etc. but changing the code page has no effect.

ruslankiskinov avatar Jan 19 '25 17:01 ruslankiskinov

Same issues

MrR0990 avatar Jan 20 '25 01:01 MrR0990