docling icon indicating copy to clipboard operation
docling copied to clipboard

fix(msword_backend): Identify text in the same line after an image #1425

Open mkrssg opened this issue 11 months ago • 9 comments

Fix: When there is a image detected in a Word Doc, also texts in the same line after the image should be detected

Resolves #1425

mkrssg avatar May 19 '25 09:05 mkrssg

Merge Protections

Your pull request matches the following merge protections and will not be merged until they are valid.

🟢 Enforce conventional commit

Wonderful, this rule succeeded.

Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/

  • [X] title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:

🟢 Require two reviewer for test updates

Wonderful, this rule succeeded.

When test data is updated, we require two reviewers

  • [X] #approved-reviews-by >= 2

mergify[bot] avatar May 19 '25 09:05 mergify[bot]

@mkrssg Thank you so much for the fix! Can you please:

  1. resolve the conflicts
  2. make sure you do the DCO

PeterStaar-IBM avatar May 19 '25 12:05 PeterStaar-IBM

@mkrssg Many thanks for creating a fix. Could you please provide a sample word doc and a before/after? Ideally, a test case with a sample doc would cover this change. The one posted in #1425 appears to have floating images next to the text, I don't fully understand how it relates to this?

cau-git avatar May 19 '25 13:05 cau-git

@cau-git So the issue in #1425 relates to the anchor of an image Screenshot 2025-05-22 at 09 37 43

Docling will output this:

**Transcript**

February 20, 2025, 8:32PM

<!-- image -->

<!-- image -->

<!-- image -->

after the fix the output is:


**Transcript**

February 20, 2025, 8:32PM

<!-- image -->

<!-- image -->

**BOS 24E.008 Huddle [Lounge] (4) - Cisco Large** 0:08
Correct, he is not.

<!-- image -->

**BOS 24E.008 Huddle [Lounge] (4) - Cisco Large** 0:16
Yeah, exactly.

(there are three images in the example)

Not sure how you want to handle the anchors.

mkrssg avatar May 22 '25 07:05 mkrssg

@mkrssg Many thanks for creating a fix. Could you please provide a sample word doc and a before/after? Ideally, a test case with a sample doc would cover this change. The one posted in #1425 appears to have floating images next to the text, I don't fully understand how it relates to this?

@cau-git Just added test case and a sample doc!

mkrssg avatar May 22 '25 17:05 mkrssg

@mkrssg Many thanks for the test case. To ensure the CI tests will pass, there's one more step to do. Since this fix changes the test ground truth for docx outputs, it must be regenerated in the following way:

# regenerate test data once
DOCLING_GEN_TEST_DATA=1 poetry run pytest ./tests/test_backend_msword.py -s -vv

# now commit the changed files in the tests/data/ dir.

cau-git avatar May 23 '25 08:05 cau-git

@mkrssg Many thanks for the test case. To ensure the CI tests will pass, there's one more step to do. Since this fix changes the test ground truth for docx outputs, it must be regenerated in the following way:

# regenerate test data once
DOCLING_GEN_TEST_DATA=1 poetry run pytest ./tests/test_backend_msword.py -s -vv

# now commit the changed files in the tests/data/ dir.

@cau-git Done that. How do you handle the conflicts in this case?

mkrssg avatar May 28 '25 15:05 mkrssg

@mkrssg It appears that a recent merge to main has created updated test files overlapping with yours. To rebase this, please update from main and resolve conflicts by accepting only your newly produced test files for the 3 affected ones.

cau-git avatar May 28 '25 17:05 cau-git

@mkrssg It appears that a recent merge to main has created updated test files overlapping with yours. To rebase this, please update from main and resolve conflicts by accepting only your newly produced test files for the 3 affected ones.

Done. Thank you @cau-git !

mkrssg avatar May 29 '25 13:05 mkrssg

@mkrssg your PR got into some issue with a test which got fixed in main. If you update your PR with the latest version we should be able to have the CI succeeding.

dolfim-ibm avatar Jun 05 '25 11:06 dolfim-ibm

@mkrssg your PR got into some issue with a test which got fixed in main. If you update your PR with the latest version we should be able to have the CI succeeding.

Done, hope it works now!

mkrssg avatar Jun 05 '25 20:06 mkrssg

@mkrssg It took a bit more than expected to get tests working again, there was a global issue affecting all PRs. We have since merged this PR. Could you rebase / update once more, please? Thanks.

cau-git avatar Jun 06 '25 08:06 cau-git

@mkrssg It took a bit more than expected to get tests working again, there was a global issue affecting all PRs. We have since merged this PR. Could you rebase / update once more, please? Thanks.

Updated again, @cau-git .

mkrssg avatar Jun 06 '25 18:06 mkrssg

@mkrssg I transferred your branch here to be able to make edits from my side, since I wasn't sure what is causing the repeated test issues. Now the test files are up to date, but several empty paragraphs are inserted where they didn't show up before. I think adding one more condition to discard completely empty paragraphs would make this work complete.

cau-git avatar Jun 10 '25 07:06 cau-git

@mkrssg I transferred your branch here to be able to make edits from my side, since I wasn't sure what is causing the repeated test issues. Now the test files are up to date, but several empty paragraphs are inserted where they didn't show up before. I think adding one more condition to discard completely empty paragraphs would make this work complete.

Working on it @cau-git . Can you explain how to recreate the issues?

mkrssg avatar Jun 16 '25 09:06 mkrssg

@cau-git @mkrssg this should resolve the issue with extraneous empty paragraphs for test_emf_docx.docx and word_sample.docx test files.

diff --git a/docling/backend/msword_backend.py b/docling/backend/msword_backend.py
index ee51e98..ec071ef 100644
--- a/docling/backend/msword_backend.py
+++ b/docling/backend/msword_backend.py
@@ -257,7 +257,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                 # Check for Text after the Image
                 if (
                     tag_name in ["p"]
-                    or element.find(".//w:p", namespaces=namespaces) is not None
+                    and element.find(".//w:t", namespaces=namespaces) is not None
                 ):
                     self._handle_text_elements(element, docx_obj, doc)
             # Check for the sdt containers, like table of contents

ksemianov avatar Jun 18 '25 10:06 ksemianov

DCO Check Passed

Thanks @mkrssg, all your commits are properly signed off. 🎉

github-actions[bot] avatar Jun 19 '25 19:06 github-actions[bot]

@cau-git @mkrssg this should resolve the issue with extraneous empty paragraphs for test_emf_docx.docx and word_sample.docx test files.

diff --git a/docling/backend/msword_backend.py b/docling/backend/msword_backend.py
index ee51e98..ec071ef 100644
--- a/docling/backend/msword_backend.py
+++ b/docling/backend/msword_backend.py
@@ -257,7 +257,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
                 # Check for Text after the Image
                 if (
                     tag_name in ["p"]
-                    or element.find(".//w:p", namespaces=namespaces) is not None
+                    and element.find(".//w:t", namespaces=namespaces) is not None
                 ):
                     self._handle_text_elements(element, docx_obj, doc)
             # Check for the sdt containers, like table of contents

Great, thank you! Integrated your suggestion.

mkrssg avatar Jun 19 '25 19:06 mkrssg

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

:loudspeaker: Thoughts on this report? Let us know!

codecov[bot] avatar Jun 20 '25 06:06 codecov[bot]