feat: Add PPTX notes slides
Presenter notes are a valuable part of a Powerpoint presentation and are worth extracting. Docling uses uses the python-pptx library for parsing Powerpoint pptx files, which supports reading from the presenter notes, and which are stored as notes slides.
Issue resolved by this Pull Request: Resolves #473
Checklist:
- [ ] Documentation has been updated, if necessary.
- [ ] Examples have been added, if necessary.
- [X] Tests have been added, if necessary.
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🔴 Require two reviewer for test updates
This rule is failing.
When test data is updated, we require two reviewers
- [ ]
#approved-reviews-by >= 2
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
@maciejwie I like your PR a lot, I just think we need an ability to distinguish between the regular text and the (invisible) text/notes.
My proposal is to first merge this (https://github.com/DS4SD/docling-core/pull/80) and then update this PR to explicitly tag it as InvisibleTextItem with the correct label.
Hi @PeterStaar-IBM, sounds good. I'm subscribed to that PR and see there's still some discussion about it, and when it gets merged I will update this one.
@maciejwie I want to go fast on this. We are just merging in now a few big performance improvements (10x faster pdf-parsing, improved layout postprocessing and GPU acceleration). Once done, we will update this one together with another PR: basically, we would like to add the author notes as part of the furniture (yes, this is the correct term: https://en.wikipedia.org/wiki/Page_layout, I was also surprised).
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
🟢 Require two reviewer for test updates
Wonderful, this rule succeeded.
When test data is updated, we require two reviewers
- [X]
#approved-reviews-by >= 2