Added Image Extraction and Storage
Hi,
I noticed that https://github.com/microsoft/markitdown/issues/1139 has expressed interest in supporting image extraction from .docx files. I personally also encountered the same need. So, this PR adds support to the DocxConverter in markitdown for extracting embedded base64-encoded images from .docx documents and saving them as individual image files. The image paths in the generated Markdown are automatically updated to reference these saved assets.
Changes
- Introduced
_extract_and_save_images()method to:
- Parse base64 images from the generated HTML.
- Save images into an
assets/{doc_name}/folder using aSHA-256hash as the filename. - Replace
<img src="data:image/...">with relative file paths likeassets/doc_name/image_xxxx.png. - Auto-generate
alttext if it's missing.
- Integrated image extraction into the
.docxto Markdown conversion pipeline. - Used existing
conversion_nameorsanitizedstream filename to create a consistent image output directory.
Example Output Structure
assets/
└── my_doc/
├── image_a3f1c2d4.png
└── image_b8e9f3a1.jpg
@microsoft-github-policy-service agree company="individual"
need it
+1
need it
need it
need it
markitdown input.docx -o output.md
请问这个该如何用呀,才能把图片保存到一个专门的image文件夹中呢