Fails to extract .emf image from the .docx document

Open sand1k opened this issue 2 years ago • 1 comments

Tested on the 'Test Summary.docx' file from the following dataset https://catalog.data.gov/dataset/test-report-for-detonation-velocity-measurements-f22cf

Extracts 4 out of 5 images. The image that is not extracted has a .emf extension.

Mar 15 '23 15:03 sand1k

It looks like python-docx2txt only supports extracting .png and .jpg files. To handle unsupported formats like .emf, you can add a check in the code and use an external tool (like ImageMagick) to convert .emf to .png. Here's an example modification: import subprocess from docx import Document

doc = Document("yourfile.docx") for rel in doc.part.rels.values(): if "image" in rel.target_ref: img_name = rel.target_ref.split("/")[-1] if img_name.endswith(('.png', '.jpg')): with open(img_name, "wb") as img_file: img_file.write(rel.target_part.blob) elif img_name.endswith('.emf'): with open(img_name, "wb") as img_file: img_file.write(rel.target_part.blob) subprocess.run(["magick", img_name, img_name.replace(".emf", ".png")]) This way, .emf files can be detected and converted automatically.

Oct 17 '24 13:10 cyy-2024