For every PDF file I tested the tool crashes with whatever UnicodeEncodeError. In every file it finds a different character to crash on.
The problem is that it didn't even try to skip the character just crashed and the output is empty which makes the tool useless.
I tested with files in Cyrillic, French, and German, and some files in English too. If the file is extremely simple it is able to convert it.
Unfortunately, I can't expose these examples here.
Environment:
Windows / Python 3.12
Error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 1809: character maps to
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\xfc' in position 1809: character maps to
OR for an Excel file:
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\u0144' in position 296: character maps to
I can provide an example with the manual of my SONY headphones:
https://www.sony.com/electronics/support/res/manuals/4559/45598331M.pdf
Here is the error:
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\xe7' in position 23: character maps to
I don't know why it tries to use CP1251 codepage as the file is PDF with no Cyrillic content in it.
Try this as a workaround:
>chcp 65001
Active code page: 65001
>set PYTHONIOENCODING=utf-8
>markitdown my_document.pdf > my_document.md
Try this as a workaround:
chcp 65001
Active code page: 65001
set PYTHONIOENCODING=utf-8
markitdown my_document.pdf > my_document.md
This workaround unfortunately didn't work:
❯ chcp 65001
Active code page: 65001
❯ set PYTHONIOENCODING=utf-8
❯ markitdown t:\Downloads\45598331M.pdf > t:\sony.md
Traceback (most recent call last):
File "", line 198, in run_module_as_main
File "", line 88, in run_code
File "d:\Dev\Python\Python312\Scripts\markitdown.exe_main.py", line 7, in
File "D:\Dev\Python\Python312\Lib\site-packages\markitdown_main.py", line 43, in main
print(result.text_content)
File "D:\Dev\Python\Python312\Lib\encodings\cp1251.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
UnicodeEncodeError: 'charmap' codec can't encode character '\xe7' in position 23: character maps to
I also tried other code pages such as 1252, 1250, etc. but changing the code page has no effect.