Fails to extract .emf image from the .docx document
Tested on the 'Test Summary.docx' file from the following dataset https://catalog.data.gov/dataset/test-report-for-detonation-velocity-measurements-f22cf
Extracts 4 out of 5 images. The image that is not extracted has a .emf extension.
It looks like python-docx2txt only supports extracting .png and .jpg files. To handle unsupported formats like .emf, you can add a check in the code and use an external tool (like ImageMagick) to convert .emf to .png. Here's an example modification: import subprocess from docx import Document
doc = Document("yourfile.docx") for rel in doc.part.rels.values(): if "image" in rel.target_ref: img_name = rel.target_ref.split("/")[-1] if img_name.endswith(('.png', '.jpg')): with open(img_name, "wb") as img_file: img_file.write(rel.target_part.blob) elif img_name.endswith('.emf'): with open(img_name, "wb") as img_file: img_file.write(rel.target_part.blob) subprocess.run(["magick", img_name, img_name.replace(".emf", ".png")]) This way, .emf files can be detected and converted automatically.