markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

update: change pdf text parser to pymupdf4llm

Open tungsten106 opened this issue 1 year ago • 13 comments

Using pymupdf4llm instead of pdfminer to parse pdf contents into markdown formats, as suggested by #131.

Pros and Cons:

  • pdfminer extract texts only, generated files have no heading, titles, etc. pymupdf4llm, however, could perform a nice markdown featrues including different levels of heading, code blocks and images (could be saved to specific path, but not included in this commit)
  • However, pymupdf4llm may easily create lines of digits which belongs to plots, and create non-existing tables. This is a common problem to most PDF parsers, except those using ocr models (such as markers, MinerU).

tungsten106 avatar Dec 19 '24 08:12 tungsten106

maybe this can patch #142

alphaleadership avatar Dec 19 '24 10:12 alphaleadership

I think it's better to let the user choose the engine rather than replacing it

I agree. There are pros and cons to each. The main thing is to allow a common interface.

Can you propose an interface for this? One option is to just call register_page_converter externally, and the precedence logic would give precedence to whichever converter is registered later. (see here for example: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1)

Another option would be to see which dependencies are installed (though this is more opaque)

afourney avatar Dec 19 '24 17:12 afourney

I think it's better to let the user choose the engine rather than replacing it

I agree. There are pros and cons to each. The main thing is to allow a common interface.

Can you propose an interface for this? One option is to just call register_page_converter externally, and the precedence logic would give precedence to whichever converter is registered later. (see here for example: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1)

Another option would be to see which dependencies are installed (though this is more opaque)

I have added a parameter pdf_engine to let the user choose engine. For example,

source = "https://arxiv.org/pdf/2308.08155v2.pdf"
markitdown.convert(source, pdf_engine="pymupdf4llm")  # use pymupdf4llm
markitdown.convert(source, pdf_engine="pdfminer")  # use pdfminer

tungsten106 avatar Dec 24 '24 07:12 tungsten106

@tungsten106 Thank you for your contribution! It looks great so far. Just one more thing—when running the tests, files is generated. Could you add the following to the .gitignore file inside tests/ and modify test to export inside out/ folder? Also, do you think there’s any way to make the tests run faster?

out/

l-lumin avatar Dec 26 '24 06:12 l-lumin

https://github.com/microsoft/markitdown/pull/139#issuecomment-2553113646 answer this if you have time

l-lumin avatar Dec 26 '24 06:12 l-lumin

@tungsten106 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.

@microsoft-github-policy-service agree [company="{your company}"]

Options:

  • (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
  • (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"

Contributor License Agreement @microsoft-github-policy-service agree

tungsten106 avatar Dec 26 '24 07:12 tungsten106

@tungsten106 Thank you for your contribution! It looks great so far. Just one more thing—when running the tests, files is generated. Could you add the following to the .gitignore file inside tests/ and modify test to export inside out/ folder? Also, do you think there’s any way to make the tests run faster?

out/

I have updated that. For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?

pip install pytest-xdist

# let it decide
pytest -n auto tests/test_markitdown.py
# or using specific cpu numbers, like 8
pytest -n 8 tests/test_markitdown.py

tungsten106 avatar Dec 26 '24 08:12 tungsten106

I have updated that. For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?

Thank you. You can run tests in parallel without using pytest-xdist; simply run hatch test -p. I want to discuss ways to improve the speed of test_markitdown_pdf

(hatch-test.py3.13) root@e2c718eb6604:/workspaces/markitdown# hatch test -p
========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
8 workers [5 items]     
ss...               

l-lumin avatar Dec 26 '24 08:12 l-lumin

I have updated that. For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?

Thank you. You can run tests in parallel without using pytest-xdist; simply run hatch test -p. I want to discuss ways to improve the speed of test_markitdown_pdf

(hatch-test.py3.13) root@e2c718eb6604:/workspaces/markitdown# hatch test -p
========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
8 workers [5 items]     
ss...               

The speed of pymupdf4llm.to_markdown might be slow due to package inner processes. We could use a smaller test pdf since the original article have 43 pages. Adding pages=[i for i in range(10)] parameters for pymupdf4llm or page_numbers=[i for i in range(10)] for pdfminer could be one solution.

tungsten106 avatar Dec 26 '24 09:12 tungsten106

Work great. I just test pages=range(10) is enough, same with page_numbers. for run one test you can run below command so don't need comment

    # test_markitdown_remote()
    # test_markitdown_local()
    # test_markitdown_exiftool()
    # test_markitdown_deprecation()
    # test_markitdown_llm()
hatch test tests/test_markitdown.py::test_markitdown_pdf

l-lumin avatar Dec 26 '24 09:12 l-lumin

maybe add clli option

alphaleadership avatar Dec 26 '24 17:12 alphaleadership

maybe add clli option

I have added an cli option --engine to choose different converters' engine and you could test it with the following command:

python -m markitdown tests/test_files/2308.08155v2.pdf --engine "pymupdf4llm" -o document.md

For other engine_kwargs it might not be suitable for cli since it involves unknown data types.

@afourney Really appreciate it if you could review this when you have time? Also thanks for @l-lumin and @alphaleadership for giving very valuable advice.

tungsten106 avatar Jan 07 '25 07:01 tungsten106

Yes, this is perfect. Let's ship 🚢

markthepixel avatar Jan 14 '25 23:01 markthepixel

It has been brought to my attention that pymupdf4llm, which makes it incompatible with our MIT license. So I'll have to close this PR unfortunately.

If you would like to use pymupdf4llm for PDFs, I recommend implementing it as a 3rd party plugin. See: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin

afourney avatar Mar 28 '25 18:03 afourney