docling icon indicating copy to clipboard operation
docling copied to clipboard

Pdf structured error

Open builuuquynh opened this issue 2 months ago • 1 comments

Bug

Parsing error in page has special layout:

  • Incorrect sequence header parsing.

...

Steps to reproduce

Upload pdf and parsing with default option

test_docling_1.pdf

Image

...

Docling version

docling: 2.63 docling serve: 1.9.0 ...

Python version

3.12

Output

Image

builuuquynh avatar Nov 27 '25 04:11 builuuquynh

Docling's PDF parsing can struggle with special layouts and sequence header extraction due to limitations in its layout model and postprocessing heuristics. Complex tables, repeated sections, and headers/footers are especially challenging—sometimes tables are misparsed as images or text blocks, and headers/footers may not be reliably extracted from PDFs (unlike DOCX files, which have better support) [source].

There are a few approaches that can help:

  • For tables or complex regions, you can configure Docling to treat tables as images and inject image references into markdown output by customizing the markdown serializer. This avoids parsing errors and keeps the document structure cleaner [source].
  • If you need to improve header/footer or section header recognition, there currently aren't robust configuration options for PDFs—these elements rely on the layout model's predictions and postprocessing heuristics, which may not handle all edge cases. Debug visualization options (like --debug-visualize-layout in the CLI) can help you inspect how clusters and headers are detected.
  • For persistent issues with sequence header parsing, post-processing the output (e.g., re-parenting misattached headers in the JSON or markdown) is sometimes necessary until upstream model improvements are released.

If you have a specific layout or header/section structure that consistently fails, sharing annotated examples with the Docling team can help guide future model updates. For now, configuration tweaks and output post-processing are the main workarounds for PDFs with special layouts.

To reply, just mention @dosu.


How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other

Chat with Dosu Open in Cursor Join Discord Share on X

dosubot[bot] avatar Nov 27 '25 05:11 dosubot[bot]

This has been a blocker for our docling usage. Is there an ETA on this issue?

savanbthakkar avatar Dec 12 '25 20:12 savanbthakkar