fix(msword_backend): Identify text in the same line after an image #1425
Fix: When there is a image detected in a Word Doc, also texts in the same line after the image should be detected
Resolves #1425
Merge Protections
Your pull request matches the following merge protections and will not be merged until they are valid.
🟢 Enforce conventional commit
Wonderful, this rule succeeded.
Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
- [X]
title ~= ^(fix|feat|docs|style|refactor|perf|test|build|ci|chore|revert)(?:\(.+\))?(!)?:
🟢 Require two reviewer for test updates
Wonderful, this rule succeeded.
When test data is updated, we require two reviewers
- [X]
#approved-reviews-by >= 2
@mkrssg Thank you so much for the fix! Can you please:
- resolve the conflicts
- make sure you do the DCO
@mkrssg Many thanks for creating a fix. Could you please provide a sample word doc and a before/after? Ideally, a test case with a sample doc would cover this change. The one posted in #1425 appears to have floating images next to the text, I don't fully understand how it relates to this?
@cau-git So the issue in #1425 relates to the anchor of an image
Docling will output this:
**Transcript**
February 20, 2025, 8:32PM
<!-- image -->
<!-- image -->
<!-- image -->
after the fix the output is:
**Transcript**
February 20, 2025, 8:32PM
<!-- image -->
<!-- image -->
**BOS 24E.008 Huddle [Lounge] (4) - Cisco Large** 0:08
Correct, he is not.
<!-- image -->
**BOS 24E.008 Huddle [Lounge] (4) - Cisco Large** 0:16
Yeah, exactly.
(there are three images in the example)
Not sure how you want to handle the anchors.
@mkrssg Many thanks for creating a fix. Could you please provide a sample word doc and a before/after? Ideally, a test case with a sample doc would cover this change. The one posted in #1425 appears to have floating images next to the text, I don't fully understand how it relates to this?
@cau-git Just added test case and a sample doc!
@mkrssg Many thanks for the test case. To ensure the CI tests will pass, there's one more step to do. Since this fix changes the test ground truth for docx outputs, it must be regenerated in the following way:
# regenerate test data once
DOCLING_GEN_TEST_DATA=1 poetry run pytest ./tests/test_backend_msword.py -s -vv
# now commit the changed files in the tests/data/ dir.
@mkrssg Many thanks for the test case. To ensure the CI tests will pass, there's one more step to do. Since this fix changes the test ground truth for docx outputs, it must be regenerated in the following way:
# regenerate test data once DOCLING_GEN_TEST_DATA=1 poetry run pytest ./tests/test_backend_msword.py -s -vv # now commit the changed files in the tests/data/ dir.
@cau-git Done that. How do you handle the conflicts in this case?
@mkrssg It appears that a recent merge to main has created updated test files overlapping with yours. To rebase this, please update from main and resolve conflicts by accepting only your newly produced test files for the 3 affected ones.
@mkrssg It appears that a recent merge to
mainhas created updated test files overlapping with yours. To rebase this, please update frommainand resolve conflicts by accepting only your newly produced test files for the 3 affected ones.
Done. Thank you @cau-git !
@mkrssg your PR got into some issue with a test which got fixed in main. If you update your PR with the latest version we should be able to have the CI succeeding.
@mkrssg your PR got into some issue with a test which got fixed in
main. If you update your PR with the latest version we should be able to have the CI succeeding.
Done, hope it works now!
@mkrssg It took a bit more than expected to get tests working again, there was a global issue affecting all PRs. We have since merged this PR. Could you rebase / update once more, please? Thanks.
@mkrssg It took a bit more than expected to get tests working again, there was a global issue affecting all PRs. We have since merged this PR. Could you rebase / update once more, please? Thanks.
Updated again, @cau-git .
@mkrssg I transferred your branch here to be able to make edits from my side, since I wasn't sure what is causing the repeated test issues. Now the test files are up to date, but several empty paragraphs are inserted where they didn't show up before. I think adding one more condition to discard completely empty paragraphs would make this work complete.
@mkrssg I transferred your branch here to be able to make edits from my side, since I wasn't sure what is causing the repeated test issues. Now the test files are up to date, but several empty paragraphs are inserted where they didn't show up before. I think adding one more condition to discard completely empty paragraphs would make this work complete.
Working on it @cau-git . Can you explain how to recreate the issues?
@cau-git @mkrssg this should resolve the issue with extraneous empty paragraphs for test_emf_docx.docx and word_sample.docx test files.
diff --git a/docling/backend/msword_backend.py b/docling/backend/msword_backend.py
index ee51e98..ec071ef 100644
--- a/docling/backend/msword_backend.py
+++ b/docling/backend/msword_backend.py
@@ -257,7 +257,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend):
# Check for Text after the Image
if (
tag_name in ["p"]
- or element.find(".//w:p", namespaces=namespaces) is not None
+ and element.find(".//w:t", namespaces=namespaces) is not None
):
self._handle_text_elements(element, docx_obj, doc)
# Check for the sdt containers, like table of contents
✅ DCO Check Passed
Thanks @mkrssg, all your commits are properly signed off. 🎉
@cau-git @mkrssg this should resolve the issue with extraneous empty paragraphs for
test_emf_docx.docxandword_sample.docxtest files.diff --git a/docling/backend/msword_backend.py b/docling/backend/msword_backend.py index ee51e98..ec071ef 100644 --- a/docling/backend/msword_backend.py +++ b/docling/backend/msword_backend.py @@ -257,7 +257,7 @@ class MsWordDocumentBackend(DeclarativeDocumentBackend): # Check for Text after the Image if ( tag_name in ["p"] - or element.find(".//w:p", namespaces=namespaces) is not None + and element.find(".//w:t", namespaces=namespaces) is not None ): self._handle_text_elements(element, docx_obj, doc) # Check for the sdt containers, like table of contents
Great, thank you! Integrated your suggestion.
Codecov Report
All modified and coverable lines are covered by tests :white_check_mark:
:loudspeaker: Thoughts on this report? Let us know!