docling icon indicating copy to clipboard operation
docling copied to clipboard

feat: Add PPTX notes slides

Open maciejwie opened this issue 1 year ago • 4 comments

Presenter notes are a valuable part of a Powerpoint presentation and are worth extracting. Docling uses uses the python-pptx library for parsing Powerpoint pptx files, which supports reading from the presenter notes, and which are stored as notes slides.

Issue resolved by this Pull Request: Resolves #473

Checklist:

  • [ ] Documentation has been updated, if necessary.
  • [ ] Examples have been added, if necessary.
  • [X] Tests have been added, if necessary.

maciejwie avatar Nov 29 '24 21:11 maciejwie

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🔴 Require two reviewer for test updates

This rule is failing.

When test data is updated, we require two reviewers

  • [ ] #approved-reviews-by >= 2

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

mergify[bot] avatar Nov 29 '24 21:11 mergify[bot]

@maciejwie I like your PR a lot, I just think we need an ability to distinguish between the regular text and the (invisible) text/notes.

My proposal is to first merge this (https://github.com/DS4SD/docling-core/pull/80) and then update this PR to explicitly tag it as InvisibleTextItem with the correct label.

PeterStaar-IBM avatar Dec 01 '24 05:12 PeterStaar-IBM

Hi @PeterStaar-IBM, sounds good. I'm subscribed to that PR and see there's still some discussion about it, and when it gets merged I will update this one.

maciejwie avatar Dec 09 '24 21:12 maciejwie

@maciejwie I want to go fast on this. We are just merging in now a few big performance improvements (10x faster pdf-parsing, improved layout postprocessing and GPU acceleration). Once done, we will update this one together with another PR: basically, we would like to add the author notes as part of the furniture (yes, this is the correct term: https://en.wikipedia.org/wiki/Page_layout, I was also surprised).

PeterStaar-IBM avatar Dec 10 '24 07:12 PeterStaar-IBM

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • [X] #approved-reviews-by >= 2

mergify[bot] avatar Mar 02 '25 19:03 mergify[bot]