docling
docling copied to clipboard
Support Html via PDF conversion
Html documents can easily be converted (or printed) to PDF.
The advantage of this process is that the printing process generates proper layout and visualization components as pages, bounding boxes, etc. The browser printing the document is also taking care of interpreting the stylesheets of webpages. The disadvantage is that the process is a bit slower than parsing the native Html format, and that all the semantic content must be re-inferred (e.g. section headers, etc)
Reading Html via PDF print will be one of the possible ways of using Html document as input. See https://github.com/DS4SD/docling/issues/107 for the native fast parsing.