Pdf structured error
Bug
Parsing error in page has special layout:
- Incorrect sequence header parsing.
...
Steps to reproduce
Upload pdf and parsing with default option
...
Docling version
docling: 2.63 docling serve: 1.9.0 ...
Python version
3.12
Output
Docling's PDF parsing can struggle with special layouts and sequence header extraction due to limitations in its layout model and postprocessing heuristics. Complex tables, repeated sections, and headers/footers are especially challenging—sometimes tables are misparsed as images or text blocks, and headers/footers may not be reliably extracted from PDFs (unlike DOCX files, which have better support) [source].
There are a few approaches that can help:
- For tables or complex regions, you can configure Docling to treat tables as images and inject image references into markdown output by customizing the markdown serializer. This avoids parsing errors and keeps the document structure cleaner [source].
- If you need to improve header/footer or section header recognition, there currently aren't robust configuration options for PDFs—these elements rely on the layout model's predictions and postprocessing heuristics, which may not handle all edge cases. Debug visualization options (like
--debug-visualize-layoutin the CLI) can help you inspect how clusters and headers are detected. - For persistent issues with sequence header parsing, post-processing the output (e.g., re-parenting misattached headers in the JSON or markdown) is sometimes necessary until upstream model improvements are released.
If you have a specific layout or header/section structure that consistently fails, sharing annotated examples with the Docling team can help guide future model updates. For now, configuration tweaks and output post-processing are the main workarounds for PDFs with special layouts.
To reply, just mention @dosu.
How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other
This has been a blocker for our docling usage. Is there an ETA on this issue?