Improve backend resolution logic
Requested feature
Document conversion currently contains a logic for "guessing" / resolving the backend to use for a given input (ref).
This logic has some limitations, e.g. when working with streams, it relies on the first 8KB to detect the backend to use — which may or may not be enough for a correct detection (e.g. deciding info could only appear at the end of a 10KB stream).
Consider ways to remove these limitations.
One possible high-level approach to examine could be to:
- remove the current layer of "guessing" a backend a priori and then committing to that guess, and
- instead, keep for each format, e.g. XML, a list of backends to try one after another, until one successfully parses (can have a default list, parametrizable by the user).
It turns out also the filetype library is loading only 8K bytes ref, so this happens also in file inputs.
As discovered in #542, some MS Office XML archives have the meta file [Content_Types].xml at the end, which is not captured by the 8K bytes signature.
One way of improving the logic could be:
- Detect if the file is a zip archive (here
filetypeshould work) - List all the files in there and check if
[Content_Types].xmlis present - In case, read it and infer the proper file type from it. Since zip archives allow random access, this could be more efficient than reading the whole file.
Another sample of a word document not detected as such is seen in issue https://github.com/DS4SD/docling/issues/476.
I'm seeing this issue for pptx files where [Content_Types].xml is present at the top, for example, this slide deck, which I've ran zipinfo on to display that [Content_Types].xml does indeed sit at the top as expected, but I've truncated the rest of the zipinfo output to clean this post up. Below that is [Content_Types].xml for the file, and I've also included a [Content_Types].xml of a similar pptx file that does process properly. Happy to help test any fixes.
Archive: name_scrubbed.pptx Zip file size: 4291099 bytes, number of entries: 115 -rw---- 1.0 fat 8628 b- defS 80-Jan-01 00:00 [Content_Types].xml ... 115 files, 4964777 bytes uncompressed, 4274329 bytes compressed: 13.9%
And the subsequent [Content_Types].xml
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="emf" ContentType="image/x-emf"/> <Default Extension="jpeg" ContentType="image/jpeg"/> <Default Extension="jpg" ContentType="image/jpeg"/> <Default Extension="png" ContentType="image/png"/> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/ppt/presentation.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"/> <Override PartName="/ppt/slideMasters/slideMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/notesMasters/notesMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml"/> <Override PartName="/ppt/handoutMasters/handoutMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml"/> <Override PartName="/ppt/presProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presProps+xml"/> <Override PartName="/ppt/viewProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.viewProps+xml"/> <Override PartName="/ppt/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/tableStyles.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.tableStyles+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout4.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout5.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout6.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout7.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout8.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout9.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout10.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout11.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout12.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout13.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout14.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout15.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout16.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout17.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout18.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout19.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout20.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout21.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout22.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout23.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout24.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout25.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout26.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout27.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout28.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout29.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout30.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout31.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout32.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout33.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout34.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout35.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout36.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout37.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout38.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout39.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout40.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/theme/theme2.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme3.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/changesInfos/changesInfo1.xml" ContentType="application/vnd.ms-powerpoint.changesinfo+xml"/> <Override PartName="/ppt/revisionInfo.xml" ContentType="application/vnd.ms-powerpoint.revisioninfo+xml"/> <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/> </Types>
Finally, I hope this is helpful, here is a [Content_Types].xml for a similar powerpoint that does process properly
<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="emf" ContentType="image/x-emf"/> <Default Extension="fntdata" ContentType="application/x-fontdata"/> <Default Extension="jpeg" ContentType="image/jpeg"/> <Default Extension="jpg" ContentType="image/jpeg"/> <Default Extension="png" ContentType="image/png"/> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/ppt/presentation.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"/> <Override PartName="/ppt/slideMasters/slideMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slideMasters/slideMaster2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/notesMasters/notesMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml"/> <Override PartName="/ppt/handoutMasters/handoutMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml"/> <Override PartName="/ppt/presProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presProps+xml"/> <Override PartName="/ppt/viewProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.viewProps+xml"/> <Override PartName="/ppt/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/tableStyles.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.tableStyles+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout4.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout5.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout6.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout7.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout8.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout9.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout10.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout11.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout12.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout13.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout14.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout15.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout16.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout17.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout18.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout19.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout20.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout21.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout22.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout23.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout24.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout25.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout26.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout27.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout28.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout29.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout30.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout31.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout32.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout33.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout34.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout35.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout36.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout37.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout38.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout39.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout40.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout41.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout42.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/theme/theme2.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme3.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme4.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/notesSlides/notesSlide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesSlide+xml"/> <Override PartName="/ppt/authors.xml" ContentType="application/vnd.ms-powerpoint.authors+xml"/> <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/> <Override PartName="/docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/> </Types>
@dolfim-ibm @cau-git
As discovered in #542, some MS Office XML archives have the meta file
[Content_Types].xmlat the end, which is not captured by the 8K bytes signature.One way of improving the logic could be:
- Detect if the file is a zip archive (here
filetypeshould work)- List all the files in there and check if
[Content_Types].xmlis present- In case, read it and infer the proper file type from it. Since zip archives allow random access, this could be more efficient than reading the whole file.
DocumentSteam() also has a name parameter. A more performance efficient algorithm would be to:
- Open all documents that are supposed to be ZIP files as
ZipFile(). If only one file format as allowlisted, use that back-end. Use the filename extension. - In case more than one ZIP-based file format was allowlisted, check the Central directory file header for an indication of the content type using, e.g., the occurrence of
[Content_Types].xmlin the return value of.name_list().
The larger issue is: why detect the content type? Besides going against the better to ask forgiveness than permission principle, it costs time every time for the corner case where someone would present a document ’guess what this is ...‘. When using Docling in a web application back-end, the content type (Internet media type) of data is supposed to be specified/detected already by the back-end. E.g., to check against ZIP bombs etc. No need for Docling to take on this responsibility.
There is no doubt the logic has to be fixed and improved, maybe also simplified altogether.
The initial use case which was pretty relevant for us is iterating through a folder of files which were crawled. That is where we learned the file extension is very unreliable. You get .pdf which are just 429 html pages, you get .php which are actually PDF, etc
You could have a fallback type detection (configurable, default off).
Any updates on this issue?
Facing this issue too with .pptx files
how about google/magika? Detect file content types with deep learning