Docling Produces Unreadable Text Output for PDFs
Note: issue here is similar to #185
Bug
I am trying to convert several PDFs of academic papers, books, etc. For some PDFs, docling produces gibberish in converting them to markdown. You find two samples here
Short of the conversation working successfully, is there a way to identify PDFs that are problematic? This would allow me to skip them, set them aside, or do a OcrOptions.force_full_page_ocr if that helps.
Steps to reproduce
Using the example code from GitHub
from docling.document_converter import DocumentConverter
source = "one.pdf" # document per local path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())
for one.pdf the output looks like this
GLYPH<28>GLYPH<27>GLYPH<26> GLYPH<25>GLYPH<24>GLYPH<28>GLYPH<23>GLYPH<22>GLYPH<21>GLYPH<20> GLYPH<25>GLYPH<19>GLYPH<20>
for two.pdf the output looks like this
2-8[ 5O@QQ[=LLGQ[J<Z[=@[MTO>D<Q@?[<R[QM@>F<H[NT<KRERZ[?FQ>LTKRQ[BLP[=TQEK@QQ[LO[Q<I@Q[ MOLJLRELK<I[ TQ@glyph<c=19,font=/AAAAAH+Fd3270>[ *LP[ FKBLOJ<RELKglyph<c=9,font=/AAAAAH+Fd3270>[ MH@<Q@[ @J<EH[ QM@>E<I;Q<H@Q%JERMO@QQglyph<c=21,font=/AAAAAH+Fd3270>JERglyph<c=19,font=/AAAAAH+Fd3270>@?T[ LO[ XOER@[ RL[ 7M@>F<H[ 7<H@Q[ )@M<PRJ@KRglyph<c=9,font=/AAAAAH+Fd3270>[ 8D@[ 2-8[ 5O@QQglyph<c=9,font=/AAAAAH+Fd3270>[ glyph<c=29,font=/AAAAAH+Fd3270>glyph<c=29,font=/AAAAAH+Fd3270>[ ,<ZX<O?[ 7RO@@Rglyph<c=8,font=/AAAAAH+Fd3270>[ (<J=PF?C@glyph<c=10,font=/AAAAAH+Fd3270>[ 2&[glyph<c=23,font=/AAAAAH+Fd3270>glyph<c=26,font=/AAAAAH+Fd3270>glyph<c=24,font=/AAAAAH+Fd3270>glyph<c=28,font=/AAAAAH+Fd3270>glyph<c=26,font=/AAAAAH+Fd3270>glyph<c=22,font=/AAAAAH+Fd3270>[
Docling version
Docling version: 2.20.0 Docling Core version: 2.17.2 Docling IBM Models version: 3.3.1 Docling Parse version: 3.3.0 Python: cpython-313 (3.13.1) Platform: macOS-15.3.1-arm64-arm-64bit-Mach-O
Python version
Python 3.13.1
@josk0
TLDR try to install following docling dependencies
docling==2.16.0
docling-core==2.15.1
docling-ibm-models==3.3.0
docling-parse==3.1.2
Hello! We had exactly the same problem, and I can't say that I understood the problem on the library side
It all started when we noticed some incorrect content in our docling documents, which just broke main flow in our app, after some tests we realised that this issue might not be repeated, depending on the hardware. But, when testing docling-serve, I faced the problem that this service gave the correct results, after several hours of testing and trying to understand what was wrong with the converter configuration, I tried to downgrade the versions used in the project to those installed in docling-serve and it helped. So, i hope that it can help you to resolve your problem, at least, temporary
@Fan4ik20 Thanks for the suggestion. I tried it but the results were the same. I also tried the latest version. In short, I reproduced the problem on my end also with the following versions
Docling version: 2.21.0
Docling Core version: 2.18.1
Docling IBM Models version: 3.3.2
Docling Parse version: 3.3.1
and
Docling version: 2.16.0
Docling Core version: 2.15.1
Docling IBM Models version: 3.3.0
Docling Parse version: 3.1.2
@josk0 This might be a problem with the docling-parse. I will investigate.
PS: for some reason, when I click on your link, I am not able to download the files. Would be of great help if you could just upload them straight into the issue.
@josk0 No need to apologize, thanks so much for the issue and examples so we can fix these issues!
@josk0 First observations
- one.pdf:
I think this is resolved in this PR (https://github.com/DS4SD/docling-parse/pull/101). If you run,
poetry run python ./docling_parse/visualize.py -i /Users/taa/Downloads/one.pdf -p 1 -l error -c line --interactive --log-text
you get the following output,
(433.17, 019.59) (444.04, 019.59) (444.04, 027.06) (433.17, 027.06) /T1_0 331
(040.15, 020.83) (108.28, 020.83) (108.28, 026.77) (040.15, 026.77) /T1_1 Philos Phenomenol Res.
(108.28, 019.99) (168.17, 019.99) (168.17, 026.83) (108.28, 026.83) /T1_2 2022;105:331-361.
(309.82, 019.99) (411.40, 019.99) (411.40, 026.83) (309.82, 026.83) /T1_2 wileyonlinelibrary.com/journal/phpr
(040.54, 027.71) (110.42, 027.71) (110.42, 033.65) (040.54, 033.65) /T1_3 Philos Phenomenol Res.
(110.42, 026.88) (152.81, 026.88) (152.81, 033.71) (110.42, 033.71) /T1_4 2021;00:1-31.
(242.39, 026.88) (245.89, 026.88) (245.89, 033.71) (242.39, 033.71) /T1_4
(422.53, 026.88) (426.03, 026.88) (426.03, 033.71) (422.53, 033.71) /T1_4
(431.29, 022.98) (432.69, 022.98) (432.69, 038.02) (431.29, 038.02) /T1_4 |
(432.69, 026.88) (434.44, 026.88) (434.44, 033.71) (432.69, 033.71) /T1_4
(440.04, 026.60) (443.54, 026.60) (443.54, 033.81) (440.04, 033.81) /T1_5 1
(322.19, 026.88) (423.77, 026.88) (423.77, 033.71) (322.19, 033.71) /T1_4 wileyonlinelibrary.com/journal/phpr
(040.53, 654.79) (116.61, 654.79) (116.61, 661.62) (040.53, 661.62) /T1_4 DOI: 10.1111/phpr.12823
(040.54, 626.86) (155.15, 626.86) (155.15, 636.12) (040.54, 636.12) /T1_5 ORIGINAL ARTICLE
(040.54, 576.67) (263.03, 576.67) (263.03, 595.21) (040.54, 595.21) /T1_5 Transparency is Surveillance
(040.54, 539.80) (115.26, 539.80) (115.26, 552.15) (040.54, 552.15) /T1_5 C. Thi Nguyen
(040.54, 044.37) (200.07, 044.37) (200.07, 051.20) (040.54, 051.20) /T1_4 © 2021 Philosophy and Phenomenological Research, Inc
(040.54, 506.52) (100.25, 506.52) (100.25, 514.34) (040.54, 514.34) /T1_4 University of Utah
(040.54, 484.21) (096.45, 484.21) (096.45, 492.45) (040.54, 492.45) /T1_5 Correspondence
(040.54, 473.52) (153.85, 473.52) (153.85, 481.34) (040.54, 481.34) /T1_4 C. Thi Nguyen, University of Utah.
(040.54, 462.52) (138.08, 462.52) (138.08, 470.34) (040.54, 470.34) /T1_4 Email: [email protected]
(201.16, 499.00) (238.37, 499.00) (238.37, 509.30) (201.16, 509.30) /T1_5 Abstract
(201.16, 484.39) (228.09, 484.39) (228.09, 494.16) (201.16, 494.16) /T1_4 In her
(228.93, 485.59) (347.83, 485.59) (347.83, 494.07) (228.93, 494.07) /T1_3 BBC Reith Lectures on Trust
(347.83, 484.39) (441.69, 484.39) (441.69, 494.16) (347.83, 494.16) /T1_4 , Onora O'Neill offers
(201.16, 469.39) (441.69, 469.39) (441.69, 479.16) (201.16, 479.16) /T1_4 a short, but biting, criticism of transparency. People think
(201.16, 454.39) (441.71, 454.39) (441.71, 464.16) (201.16, 464.16) /T1_4 that trust and transparency go together but in reality, says
(201.16, 439.39) (441.71, 439.39) (441.71, 449.16) (201.16, 449.16) /T1_4 O'Neill, they are deeply opposed. Transparency forces
(201.16, 424.39) (439.22, 424.39) (439.22, 434.16) (201.16, 434.16) /T1_4 people to conceal their actual reasons for action and in-
(201.16, 409.39) (441.69, 409.39) (441.69, 419.16) (201.16, 419.16) /T1_4 vent different ones for public consumption. Transparency
(201.16, 394.39) (441.67, 394.39) (441.67, 404.16) (201.16, 404.16) /T1_4 forces deception. I work out the details of her argument and
(201.16, 379.39) (441.69, 379.39) (441.69, 389.16) (201.16, 389.16) /T1_4 worsen her conclusion. I focus on public transparency - that
(201.16, 364.39) (441.70, 364.39) (441.70, 374.16) (201.16, 374.16) /T1_4 is, transparency to the public over expert domains. I offer
(201.16, 349.39) (361.58, 349.39) (361.58, 359.16) (201.16, 359.16) /T1_4 two versions of the criticism. First, the
(362.16, 350.59) (439.17, 350.59) (439.17, 359.07) (362.16, 359.07) /T1_3 epistemic intrusion
(439.17, 349.39) (441.67, 349.39) (441.67, 359.16) (439.17, 359.16) /T1_4
(201.16, 334.39) (439.18, 334.39) (439.18, 344.16) (201.16, 344.16) /T1_4 argument: The drive to transparency forces experts to ex-
(201.16, 319.39) (441.70, 319.39) (441.70, 329.16) (201.16, 329.16) /T1_4 plain their reasoning to non- experts. But expert reasons are,
(201.16, 304.39) (441.68, 304.39) (441.68, 314.16) (201.16, 314.16) /T1_4 by their nature, often inaccessible to non- experts. So the
(201.16, 289.39) (441.69, 289.39) (441.69, 299.16) (201.16, 299.16) /T1_4 demand for transparency can pressure experts to act only
(201.16, 274.39) (441.69, 274.39) (441.69, 284.16) (201.16, 284.16) /T1_4 in those ways for which they can offer public justification.
(201.16, 259.39) (251.50, 259.39) (251.50, 269.16) (201.16, 269.16) /T1_4 Second, the
(252.68, 260.59) (320.27, 260.59) (320.27, 269.07) (252.68, 269.07) /T1_3 intimate reasons
(320.27, 259.39) (441.70, 259.39) (441.70, 269.16) (320.27, 269.16) /T1_4 argument: In many cases of
(201.16, 244.40) (441.69, 244.40) (441.69, 254.16) (201.16, 254.16) /T1_4 practical deliberation, the relevant reasons are intimate to
(201.16, 229.40) (441.69, 229.40) (441.69, 239.16) (201.16, 239.16) /T1_4 a community and not easily explicable to those who lack
(201.16, 214.40) (439.20, 214.40) (439.20, 224.16) (201.16, 224.16) /T1_4 a particular shared background. The demand for transpar-
(201.16, 199.40) (441.69, 199.40) (441.69, 209.16) (201.16, 209.16) /T1_4 ency, then, pressures community members to abandon the
(201.16, 184.40) (441.69, 184.40) (441.69, 194.16) (201.16, 194.16) /T1_4 special understanding and sensitivity that arises from their
(201.16, 169.40) (441.67, 169.40) (441.67, 179.16) (201.16, 179.16) /T1_4 particular experiences. Transparency, it turns out, is a form
(201.16, 154.40) (441.66, 154.40) (441.66, 164.16) (201.16, 164.16) /T1_4 of surveillance. By forcing reasoning into the explicit and
(201.16, 139.40) (441.69, 139.40) (441.69, 149.16) (201.16, 149.16) /T1_4 public sphere, transparency roots out corruption - but it
(201.16, 124.39) (439.17, 124.39) (439.17, 134.16) (201.16, 134.16) /T1_4 also inhibits the full application of expert skill, sensitiv-
(201.16, 109.39) (441.69, 109.39) (441.69, 119.16) (201.16, 119.16) /T1_4 ity, and subtle shared understandings. The difficulty here
(201.16, 094.39) (439.19, 094.39) (439.19, 104.16) (201.16, 104.16) /T1_4 arises from the basic fact that human knowledge vastly out-
(201.16, 079.39) (441.70, 079.39) (441.70, 089.16) (201.16, 089.16) /T1_4 strips any individual's capacities. We all depend on experts,
- For two.pdf, we have indeed a nasty problem. I hope we can resolve it soon, but my suspicion is that it comes from the built in OCR of the scanner. Will keep you posted.
Thanks. I haven't used poetry before. If you'd prefer me to confirm that I get this output on my end as well, let me know. Happy to figure it out.
On two.pdf:
- I may have more of where this came from (and potentially other problematic PDFs). Let me know if it would help if I sampled more.
- If there is an idea for workaround to identify the nasty problems to set them aside, let me know
The more the better, but for now, let me see if there is an "easy" fix. I will be out next week, but feel free to bug my colleagues for updates on when this PR (https://github.com/DS4SD/docling-parse/pull/101) will be merged (at least it solves 1 problem).
Even with that PR, I still have problems with the one.pdf. I tried it again with the latest version (your PR was merged for the latest docling-parse release) and the error persists.
Docling Version
Docling version: 2.23.0
Docling Core version: 2.19.1
Docling IBM Models version: 3.3.2
Docling Parse version: 3.4.0
Python: cpython-313 (3.13.1)
As before, the abstract extracts well, but then a wall of GLYPHS.
Fun fact? When I use macOS preview app to remove pages from the one.pdf, the resulting "shortened" version converts OK, both on the old as well as on the new version of docling(-parse)...
About more PDFs for testing I have about seven PDFs of different origins (and maybe 20-30 more that I need to look into) that all cause some kind of problem. Happy to share them directly (but because of copyright issues, would prefer not to upload them publicly)
I tried it and seems to work well with the PyPdfiumDocumentBackend. Did you try it with that?
from docling.backend.pypdfium2_backend import PyPdfiumDocumentBackend
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options, backend=PyPdfiumDocumentBackend)
}
)
Thanks. Yes with the PyPdfiumDocumentBackend some files, including the one.pdf from the sample, are converted correctly!
Others are still problematic. The output is different but similarly unreadable.
Ok. I am also working on a platform and planning to use docling as the backend for document parsing. Wondering how such variety of files and content can be handled in the best way possible.
How much time it takes to convert and export to markdown, do you guys have any strategy? Because in my system it takes around 2mins to convert a 9 page PDF. Other OCR packages takes only few milliseconds to do this stuff.
How much time it takes to convert and export to markdown, do you guys have any strategy? Because in my system it takes around 2mins to convert a 9 page PDF. Other OCR packages takes only few milliseconds to do this stuff.
@vishaldasnewtide You do have some reference numbers here https://arxiv.org/pdf/2501.17887. We also breakdown the different (optional) steps in the pipeline. For example, when not needed, we suggest deactivating OCR since it often takes a 3x factor.
Even with that PR, I still have problems with the one.pdf. I tried it again with the latest version (your PR was merged for the latest docling-parse release) and the error persists.
@josk0 I confirm "Docling Parse version: 3.4.0" is the latest one with quite some fixes. If this is not yet solving your issues, we will try to address in the next round. Feel free to provide more problematic docs.
Ok. I am also working on a platform and planning to use docling as the backend for document parsing. Wondering how such variety of files and content can be handled in the best way possible.
@nikhildigde just making you aware of the docling-serve project, where we are aggregating the multiple approaches exposing Docling as a service. It might be useful for your use case. More features like async processing are coming soon.
@dolfim-ibm yes we are already using docling-serve as a http layer for docling.