docling icon indicating copy to clipboard operation
docling copied to clipboard

Support arxiv html papers

Open dai-shuo opened this issue 1 year ago • 0 comments

Arxiv provides static html version of most papers using LateXML. The html contents are well structured by rich ltx_xxxx CSS classnames. It should be lightning fast parsing those paper htmls and get very precise info. It would be cool to support arxiv html parsing, as a much faster branch or a strong hint for the pipeline.

dai-shuo avatar Nov 03 '24 01:11 dai-shuo