markitdown
markitdown copied to clipboard
[bug] Markitdown failed to convert pdf that contains image
cn_dissertation_1st_page.pdf In trying to analyze the attached file with
result = md.convert(file_path)
return result.text_content
I got the following error
Traceback (most recent call last):
File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/markitdown/_markitdown.py", line 1239, in _convert
res = converter.convert(local_path, **_kwargs)
File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/markitdown/_markitdown.py", line 490, in convert
text_content=pdfminer.high_level.extract_text(local_path),
File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
for page in PDFPage.get_pages(
File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
for (pageno, page) in enumerate(cls.create_pages(doc)):
File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
yield cls(document, objid, tree, next(page_labels))
File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 64, in __init__
resolve1(mediabox_param) for mediabox_param in self.attrs["MediaBox"]
KeyError: 'MediaBox'
I am running it on Python 3.10 in MacOS 15.2 (24C101)
atter consulting with o1 and tinkering with it. I realize that it is because I am using pymupdf to reconstruct the pdf page and thus missing this meta info.
check out #139
So is the expectation for this to work with pdfs that contain images that you write your own plugin? This limitation should be made clear.