docling icon indicating copy to clipboard operation
docling copied to clipboard

Improve backend resolution logic

Open vagenas opened this issue 1 year ago • 7 comments

Requested feature

Document conversion currently contains a logic for "guessing" / resolving the backend to use for a given input (ref).

This logic has some limitations, e.g. when working with streams, it relies on the first 8KB to detect the backend to use — which may or may not be enough for a correct detection (e.g. deciding info could only appear at the end of a 10KB stream).

Consider ways to remove these limitations.

One possible high-level approach to examine could be to:

  • remove the current layer of "guessing" a backend a priori and then committing to that guess, and
  • instead, keep for each format, e.g. XML, a list of backends to try one after another, until one successfully parses (can have a default list, parametrizable by the user).

vagenas avatar Jan 24 '25 13:01 vagenas

It turns out also the filetype library is loading only 8K bytes ref, so this happens also in file inputs.

dolfim-ibm avatar Jan 29 '25 14:01 dolfim-ibm

As discovered in #542, some MS Office XML archives have the meta file [Content_Types].xml at the end, which is not captured by the 8K bytes signature.

One way of improving the logic could be:

  1. Detect if the file is a zip archive (here filetype should work)
  2. List all the files in there and check if [Content_Types].xml is present
  3. In case, read it and infer the proper file type from it. Since zip archives allow random access, this could be more efficient than reading the whole file.

dolfim-ibm avatar Jan 29 '25 14:01 dolfim-ibm

Another sample of a word document not detected as such is seen in issue https://github.com/DS4SD/docling/issues/476.

cau-git avatar Jan 31 '25 09:01 cau-git

I'm seeing this issue for pptx files where [Content_Types].xml is present at the top, for example, this slide deck, which I've ran zipinfo on to display that [Content_Types].xml does indeed sit at the top as expected, but I've truncated the rest of the zipinfo output to clean this post up. Below that is [Content_Types].xml for the file, and I've also included a [Content_Types].xml of a similar pptx file that does process properly. Happy to help test any fixes.

Archive: name_scrubbed.pptx Zip file size: 4291099 bytes, number of entries: 115 -rw---- 1.0 fat 8628 b- defS 80-Jan-01 00:00 [Content_Types].xml ... 115 files, 4964777 bytes uncompressed, 4274329 bytes compressed: 13.9%

And the subsequent [Content_Types].xml

<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="emf" ContentType="image/x-emf"/> <Default Extension="jpeg" ContentType="image/jpeg"/> <Default Extension="jpg" ContentType="image/jpeg"/> <Default Extension="png" ContentType="image/png"/> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/ppt/presentation.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"/> <Override PartName="/ppt/slideMasters/slideMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/notesMasters/notesMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml"/> <Override PartName="/ppt/handoutMasters/handoutMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml"/> <Override PartName="/ppt/presProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presProps+xml"/> <Override PartName="/ppt/viewProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.viewProps+xml"/> <Override PartName="/ppt/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/tableStyles.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.tableStyles+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout4.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout5.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout6.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout7.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout8.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout9.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout10.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout11.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout12.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout13.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout14.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout15.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout16.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout17.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout18.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout19.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout20.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout21.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout22.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout23.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout24.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout25.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout26.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout27.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout28.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout29.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout30.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout31.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout32.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout33.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout34.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout35.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout36.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout37.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout38.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout39.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout40.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/theme/theme2.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme3.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/changesInfos/changesInfo1.xml" ContentType="application/vnd.ms-powerpoint.changesinfo+xml"/> <Override PartName="/ppt/revisionInfo.xml" ContentType="application/vnd.ms-powerpoint.revisioninfo+xml"/> <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/> </Types>

Finally, I hope this is helpful, here is a [Content_Types].xml for a similar powerpoint that does process properly

<Types xmlns="http://schemas.openxmlformats.org/package/2006/content-types"> <Default Extension="emf" ContentType="image/x-emf"/> <Default Extension="fntdata" ContentType="application/x-fontdata"/> <Default Extension="jpeg" ContentType="image/jpeg"/> <Default Extension="jpg" ContentType="image/jpeg"/> <Default Extension="png" ContentType="image/png"/> <Default Extension="rels" ContentType="application/vnd.openxmlformats-package.relationships+xml"/> <Default Extension="xml" ContentType="application/xml"/> <Override PartName="/ppt/presentation.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presentation.main+xml"/> <Override PartName="/ppt/slideMasters/slideMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slideMasters/slideMaster2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideMaster+xml"/> <Override PartName="/ppt/slides/slide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/slides/slide2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slide+xml"/> <Override PartName="/ppt/notesMasters/notesMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesMaster+xml"/> <Override PartName="/ppt/handoutMasters/handoutMaster1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.handoutMaster+xml"/> <Override PartName="/ppt/presProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.presProps+xml"/> <Override PartName="/ppt/viewProps.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.viewProps+xml"/> <Override PartName="/ppt/theme/theme1.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/tableStyles.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.tableStyles+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout2.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout3.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout4.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout5.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout6.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout7.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout8.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout9.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout10.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout11.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout12.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout13.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout14.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout15.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout16.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout17.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout18.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout19.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout20.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout21.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout22.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout23.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout24.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout25.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout26.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout27.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout28.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout29.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout30.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout31.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout32.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout33.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout34.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout35.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout36.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout37.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout38.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout39.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout40.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout41.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/slideLayouts/slideLayout42.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.slideLayout+xml"/> <Override PartName="/ppt/theme/theme2.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme3.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/theme/theme4.xml" ContentType="application/vnd.openxmlformats-officedocument.theme+xml"/> <Override PartName="/ppt/notesSlides/notesSlide1.xml" ContentType="application/vnd.openxmlformats-officedocument.presentationml.notesSlide+xml"/> <Override PartName="/ppt/authors.xml" ContentType="application/vnd.ms-powerpoint.authors+xml"/> <Override PartName="/docProps/core.xml" ContentType="application/vnd.openxmlformats-package.core-properties+xml"/> <Override PartName="/docProps/app.xml" ContentType="application/vnd.openxmlformats-officedocument.extended-properties+xml"/> <Override PartName="/docProps/custom.xml" ContentType="application/vnd.openxmlformats-officedocument.custom-properties+xml"/> </Types>

dward4 avatar Feb 03 '25 15:02 dward4

@dolfim-ibm @cau-git

As discovered in #542, some MS Office XML archives have the meta file [Content_Types].xml at the end, which is not captured by the 8K bytes signature.

One way of improving the logic could be:

  1. Detect if the file is a zip archive (here filetype should work)
  2. List all the files in there and check if [Content_Types].xml is present
  3. In case, read it and infer the proper file type from it. Since zip archives allow random access, this could be more efficient than reading the whole file.

DocumentSteam() also has a name parameter. A more performance efficient algorithm would be to:

  1. Open all documents that are supposed to be ZIP files as ZipFile(). If only one file format as allowlisted, use that back-end. Use the filename extension.
  2. In case more than one ZIP-based file format was allowlisted, check the Central directory file header for an indication of the content type using, e.g., the occurrence of [Content_Types].xml in the return value of .name_list().

The larger issue is: why detect the content type? Besides going against the better to ask forgiveness than permission principle, it costs time every time for the corner case where someone would present a document ’guess what this is ...‘. When using Docling in a web application back-end, the content type (Internet media type) of data is supposed to be specified/detected already by the back-end. E.g., to check against ZIP bombs etc. No need for Docling to take on this responsibility.

sanmai-NL avatar Feb 13 '25 08:02 sanmai-NL

There is no doubt the logic has to be fixed and improved, maybe also simplified altogether.

The initial use case which was pretty relevant for us is iterating through a folder of files which were crawled. That is where we learned the file extension is very unreliable. You get .pdf which are just 429 html pages, you get .php which are actually PDF, etc

dolfim-ibm avatar Feb 13 '25 11:02 dolfim-ibm

You could have a fallback type detection (configurable, default off).

sanmai-NL avatar Feb 13 '25 13:02 sanmai-NL

Any updates on this issue?

yssAI avatar Apr 02 '25 10:04 yssAI

Facing this issue too with .pptx files

anuar12 avatar Apr 24 '25 15:04 anuar12

how about google/magika? Detect file content types with deep learning

whisper-bye avatar May 09 '25 03:05 whisper-bye