Option to export markdown with references to images instead of embedding them as base64
It would be nice to have an option to export markdown files with images as references to files instead of embedding them in the document as base 64. This might be similar to how "figure export" example extracts and stores files
I also noticed that after using result.document.export_to_markdown(), the images in the Markdown are completely removed.
Hope extract the images and save them in a folder called image
this is a feature I am really interested too.
I had similar experiences to @Tendo33, after pdf-to-md conversions images are lost and substituted with an html comment of <!-- image -->. Where are you seeing base64 images? I can easily add another step after it to remove them from the md and put them in a separate file...
@Tendo33 @cenit @uninstall-your-browser this feature is already supported, you can configure the pipeline to extract pictures, and change arguments to the export_to_markdown() method. Please refer to this post.
I was interested in them being references to files rather than embedded base64 blobs
@uninstall-your-browser we will put file references for images on our TODO list and work on it soon.
Would this be also possible for all other formats that could contain images (e.g. PPTX, DOCX etc.)?
I tried the export_figures example and it worked well for PDF. I wanted to adapt it to PPTX but according to pipeline_options.py
there is only PdfPipelineOptions which can set generate_picture_images.
My current workaround would be as follows:
- Convert the pptx to markdown with standard options
- Rename the provided pptx-file to
.zipand extract the corresponding images from/ppt/media - Manually reference the images in the text
Maybe this could be implemented into docling at least for pptx and docx.
@pwab indeed, we have to add this functionality to all other backends (apart from PDF), I'll look into it