docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: Integrate docling-hierarchical-pdf back into docling

Open krrome opened this issue 2 months ago • 2 comments

This is still a draft with limited functionality (and failing tests) to gauge whether my approach of the integration is in line with the docling team. I will keep extending the PR to full functionality, but I would like to receive feedback on the integration as early as possible.

Changes:

  • The reading order model was extended to handle header hierarchies.
  • docling/models/header_hierarchy was added as a home to header level inference

Issue resolved by this Pull Request: Resolves #2591, #652, #287, #1023, #2121 and maybe more.

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [ ] Examples have been added, if necessary.
  • [x] Tests have been added, if necessary.

krrome avatar Nov 24 '25 19:11 krrome

DCO Check Passed

Thanks @krrome, all your commits are properly signed off. 🎉

github-actions[bot] avatar Nov 24 '25 19:11 github-actions[bot]

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Nov 24 '25 19:11 mergify[bot]

Related Documentation

Checked 5 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

dosubot[bot] avatar Dec 05 '25 10:12 dosubot[bot]

Hi all,

Thank you for reviewing. From my point of view this now not a draft anymore. Unfortunately the changes required are quite substantial.

I still have a few questions which I would like to ask before you are off reviewing in detail:

  1. Is this PR too big for your taste in order to make it? Reason: I have invested quite a lot of time in this integration, so if you think it is too much then I'll stop the efforts there and focus on something else :)
  2. Some automated tests are failing on github because the test job runs out of disk... How can I fix that?
  3. I haven't gotten around to add metadata-TOC-support for pdfium in my code
  4. Using metadata-based TOCs as input: when headings are all caps in text, but not in TOC then they are not found, do you want me to fix that?
  5. When inferring header hierarchy from numbered headers then errors might occur when bigger sections like here: https://github.com/docling-project/docling/pull/2676/files#r2592219081

Looking forward to your feedback.

Thanks, Roman

krrome avatar Dec 05 '25 11:12 krrome