docling icon indicating copy to clipboard operation
docling copied to clipboard

Support Docx via PDF conversion

Open dolfim-ibm opened this issue 1 year ago • 1 comments

Docx documents can easily be converted (or printed) to PDF.

The advantage of this process is that the printing process generates proper layout and visualization components as pages, bounding boxes, etc. The disadvantage is that the process is a bit slower than reading the native Docx format, and that all the semantic content must be re-inferred (e.g. section headers, etc)

Reading Docx via PDF conversion will be one of the possible ways of using Docx document as input. See #105 for the native fast parsing.

dolfim-ibm avatar Sep 26 '24 09:09 dolfim-ibm

Docx documents can easily be converted (or printed) to PDF.

To my knowledge, there are pretty much two ways to convert docx to pdf. Using either something like docx2pdf which uses installed MS Word application to do the conversion, this will not work eg. on linux servers. Another way to do this is using headless LibreOffice but this is also quite heavy dependency. Just out of curiosity, is there some better solution to this than my given examples?

valstu avatar Jan 24 '25 13:01 valstu