markitdown Added Image Extraction and Storage

Hi,

I noticed that https://github.com/microsoft/markitdown/issues/1139 has expressed interest in supporting image extraction from .docx files. I personally also encountered the same need. So, this PR adds support to the DocxConverter in markitdown for extracting embedded base64-encoded images from .docx documents and saving them as individual image files. The image paths in the generated Markdown are automatically updated to reference these saved assets.

Changes

Introduced _extract_and_save_images() method to:

Parse base64 images from the generated HTML.
Save images into an assets/{doc_name}/ folder using a SHA-256 hash as the filename.
Replace <img src="data:image/..."> with relative file paths like assets/doc_name/image_xxxx.png.
Auto-generate alt text if it's missing.

Integrated image extraction into the .docx to Markdown conversion pipeline.
Used existing conversion_name or sanitized stream filename to create a consistent image output directory.

Example Output Structure

assets/
└── my_doc/
    ├── image_a3f1c2d4.png
    └── image_b8e9f3a1.jpg

Apr 30 '25 02:04 Noah-Zhuhaotian

@microsoft-github-policy-service agree company="individual"

Apr 30 '25 02:04 Noah-Zhuhaotian

need it

May 14 '25 06:05 naliazheli

+1

May 19 '25 23:05 wangerzi

need it

Jun 03 '25 07:06 jidaojiuyou

need it

Nov 04 '25 07:11 Jeandoom

need it

Dec 05 '25 02:12 WangJianQ-0118

markitdown input.docx -o output.md

请问这个该如何用呀，才能把图片保存到一个专门的image文件夹中呢

Dec 08 '25 05:12 WangJianQ-0118