update: change pdf text parser to pymupdf4llm
Using pymupdf4llm instead of pdfminer to parse pdf contents into markdown formats, as suggested by #131.
Pros and Cons:
-
pdfminerextract texts only, generated files have no heading, titles, etc.pymupdf4llm, however, could perform a nice markdown featrues including different levels of heading, code blocks and images (could be saved to specific path, but not included in this commit) - However,
pymupdf4llmmay easily create lines of digits which belongs to plots, and create non-existing tables. This is a common problem to most PDF parsers, except those using ocr models (such as markers, MinerU).
maybe this can patch #142
I think it's better to let the user choose the engine rather than replacing it
I agree. There are pros and cons to each. The main thing is to allow a common interface.
Can you propose an interface for this? One option is to just call register_page_converter externally, and the precedence logic would give precedence to whichever converter is registered later. (see here for example: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1)
Another option would be to see which dependencies are installed (though this is more opaque)
I think it's better to let the user choose the engine rather than replacing it
I agree. There are pros and cons to each. The main thing is to allow a common interface.
Can you propose an interface for this? One option is to just call register_page_converter externally, and the precedence logic would give precedence to whichever converter is registered later. (see here for example: https://github.com/microsoft/markitdown/blob/925c4499f72757abcf6cb521ee10e4844967af3d/src/markitdown/_markitdown.py#L1269C1-L1287C1)
Another option would be to see which dependencies are installed (though this is more opaque)
I have added a parameter pdf_engine to let the user choose engine. For example,
source = "https://arxiv.org/pdf/2308.08155v2.pdf"
markitdown.convert(source, pdf_engine="pymupdf4llm") # use pymupdf4llm
markitdown.convert(source, pdf_engine="pdfminer") # use pdfminer
@tungsten106
Thank you for your contribution! It looks great so far.
Just one more thing—when running the tests, files is generated. Could you add the following to the .gitignore file inside tests/ and modify test to export inside out/ folder?
Also, do you think there’s any way to make the tests run faster?
out/
https://github.com/microsoft/markitdown/pull/139#issuecomment-2553113646 answer this if you have time
@tungsten106 please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]Options:
- (default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
- (when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"Contributor License Agreement @microsoft-github-policy-service agree
@tungsten106 Thank you for your contribution! It looks great so far. Just one more thing—when running the tests, files is generated. Could you add the following to the
.gitignorefile insidetests/and modify test to export insideout/folder? Also, do you think there’s any way to make the tests run faster?out/
I have updated that.
For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?
pip install pytest-xdist
# let it decide
pytest -n auto tests/test_markitdown.py
# or using specific cpu numbers, like 8
pytest -n 8 tests/test_markitdown.py
I have updated that. For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?
Thank you. You can run tests in parallel without using pytest-xdist; simply run hatch test -p. I want to discuss ways to improve the speed of test_markitdown_pdf
(hatch-test.py3.13) root@e2c718eb6604:/workspaces/markitdown# hatch test -p
========================================================================================================== test session starts ==========================================================================================================
platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0
rootdir: /workspaces/markitdown
configfile: pyproject.toml
plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1
8 workers [5 items]
ss...
I have updated that. For test speed, have you tried to use pytest-xdist to run test_markitdown.py in parallel?
Thank you. You can run tests in parallel without using
pytest-xdist; simply runhatch test -p. I want to discuss ways to improve the speed oftest_markitdown_pdf(hatch-test.py3.13) root@e2c718eb6604:/workspaces/markitdown# hatch test -p ========================================================================================================== test session starts ========================================================================================================== platform linux -- Python 3.13.1, pytest-8.3.4, pluggy-1.5.0 rootdir: /workspaces/markitdown configfile: pyproject.toml plugins: rerunfailures-14.0, mock-3.14.0, anyio-4.7.0, xdist-3.6.1 8 workers [5 items] ss...
The speed of pymupdf4llm.to_markdown might be slow due to package inner processes.
We could use a smaller test pdf since the original article have 43 pages. Adding pages=[i for i in range(10)] parameters for pymupdf4llm or page_numbers=[i for i in range(10)] for pdfminer could be one solution.
Work great. I just test pages=range(10) is enough, same with page_numbers. for run one test you can run below command so don't need comment
# test_markitdown_remote()
# test_markitdown_local()
# test_markitdown_exiftool()
# test_markitdown_deprecation()
# test_markitdown_llm()
hatch test tests/test_markitdown.py::test_markitdown_pdf
maybe add clli option
maybe add clli option
I have added an cli option --engine to choose different converters' engine and you could test it with the following command:
python -m markitdown tests/test_files/2308.08155v2.pdf --engine "pymupdf4llm" -o document.md
For other engine_kwargs it might not be suitable for cli since it involves unknown data types.
@afourney Really appreciate it if you could review this when you have time? Also thanks for @l-lumin and @alphaleadership for giving very valuable advice.
Yes, this is perfect. Let's ship 🚢
It has been brought to my attention that pymupdf4llm, which makes it incompatible with our MIT license. So I'll have to close this PR unfortunately.
If you would like to use pymupdf4llm for PDFs, I recommend implementing it as a 3rd party plugin. See: https://github.com/microsoft/markitdown/tree/main/packages/markitdown-sample-plugin