markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Added Image Extraction and Storage

Open Noah-Zhuhaotian opened this issue 9 months ago • 5 comments

Hi,

I noticed that https://github.com/microsoft/markitdown/issues/1139 has expressed interest in supporting image extraction from .docx files. I personally also encountered the same need. So, this PR adds support to the DocxConverter in markitdown for extracting embedded base64-encoded images from .docx documents and saving them as individual image files. The image paths in the generated Markdown are automatically updated to reference these saved assets.

Changes

  1. Introduced _extract_and_save_images() method to:
  • Parse base64 images from the generated HTML.
  • Save images into an assets/{doc_name}/ folder using a SHA-256 hash as the filename.
  • Replace <img src="data:image/..."> with relative file paths like assets/doc_name/image_xxxx.png.
  • Auto-generate alt text if it's missing.
  1. Integrated image extraction into the .docx to Markdown conversion pipeline.
  2. Used existing conversion_name or sanitized stream filename to create a consistent image output directory.

Example Output Structure

assets/
└── my_doc/
    ├── image_a3f1c2d4.png
    └── image_b8e9f3a1.jpg

Noah-Zhuhaotian avatar Apr 30 '25 02:04 Noah-Zhuhaotian

@microsoft-github-policy-service agree company="individual"

Noah-Zhuhaotian avatar Apr 30 '25 02:04 Noah-Zhuhaotian

need it

naliazheli avatar May 14 '25 06:05 naliazheli

+1

wangerzi avatar May 19 '25 23:05 wangerzi

need it

jidaojiuyou avatar Jun 03 '25 07:06 jidaojiuyou

need it

Jeandoom avatar Nov 04 '25 07:11 Jeandoom

need it

WangJianQ-0118 avatar Dec 05 '25 02:12 WangJianQ-0118

markitdown input.docx -o output.md

请问这个该如何用呀,才能把图片保存到一个专门的image文件夹中呢

WangJianQ-0118 avatar Dec 08 '25 05:12 WangJianQ-0118