markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

[bug] Markitdown failed to convert pdf that contains image

Open Drjunchenfeng opened this issue 1 year ago • 3 comments

cn_dissertation_1st_page.pdf In trying to analyze the attached file with

    result = md.convert(file_path)
    return result.text_content

I got the following error

Traceback (most recent call last):
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/markitdown/_markitdown.py", line 1239, in _convert
    res = converter.convert(local_path, **_kwargs)
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/markitdown/_markitdown.py", line 490, in convert
    text_content=pdfminer.high_level.extract_text(local_path),
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/high_level.py", line 169, in extract_text
    for page in PDFPage.get_pages(
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 171, in get_pages
    for (pageno, page) in enumerate(cls.create_pages(doc)):
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 127, in create_pages
    yield cls(document, objid, tree, next(page_labels))
  File "/Users/fengjunchen/Library/Caches/pypoetry/virtualenvs/ai-advisor-IuYXIAy1-py3.10/lib/python3.10/site-packages/pdfminer/pdfpage.py", line 64, in __init__
    resolve1(mediabox_param) for mediabox_param in self.attrs["MediaBox"]
KeyError: 'MediaBox'

I am running it on Python 3.10 in MacOS 15.2 (24C101)

Drjunchenfeng avatar Dec 26 '24 07:12 Drjunchenfeng

atter consulting with o1 and tinkering with it. I realize that it is because I am using pymupdf to reconstruct the pdf page and thus missing this meta info.

Drjunchenfeng avatar Dec 26 '24 08:12 Drjunchenfeng

check out #139

l-lumin avatar Dec 26 '24 08:12 l-lumin

So is the expectation for this to work with pdfs that contain images that you write your own plugin? This limitation should be made clear.

supernitin avatar May 02 '25 20:05 supernitin