feat: Integrate docling-hierarchical-pdf back into docling
This is still a draft with limited functionality (and failing tests) to gauge whether my approach of the integration is in line with the docling team. I will keep extending the PR to full functionality, but I would like to receive feedback on the integration as early as possible.
Changes:
- The reading order model was extended to handle header hierarchies.
-
docling/models/header_hierarchywas added as a home to header level inference
Issue resolved by this Pull Request: Resolves #2591, #652, #287, #1023, #2121 and maybe more.
Checklist:
- [ ] Documentation has been updated, if necessary.
- [ ] Examples have been added, if necessary.
- [x] Tests have been added, if necessary.
✅ DCO Check Passed
Thanks @krrome, all your commits are properly signed off. 🎉
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🔴 Require two reviewer for test updates
This rule is failing.
When test data is updated, we require two reviewers
- [ ]
#approved-reviews-by >= 2
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
Related Documentation
Checked 5 published document(s) in 1 knowledge base(s). No updates required.
Hi all,
Thank you for reviewing. From my point of view this now not a draft anymore. Unfortunately the changes required are quite substantial.
I still have a few questions which I would like to ask before you are off reviewing in detail:
- Is this PR too big for your taste in order to make it? Reason: I have invested quite a lot of time in this integration, so if you think it is too much then I'll stop the efforts there and focus on something else :)
- Some automated tests are failing on github because the test job runs out of disk... How can I fix that?
- I haven't gotten around to add metadata-TOC-support for pdfium in my code
- Using metadata-based TOCs as input: when headings are all caps in text, but not in TOC then they are not found, do you want me to fix that?
- When inferring header hierarchy from numbered headers then errors might occur when bigger sections like here: https://github.com/docling-project/docling/pull/2676/files#r2592219081
Looking forward to your feedback.
Thanks, Roman
Codecov Report
:x: Patch coverage is 90.24823% with 55 lines in your changes missing coverage. Please review.
:loudspeaker: Thoughts on this report? Let us know!