docling
docling copied to clipboard
Support arxiv html papers
Arxiv provides static html version of most papers using LateXML. The html contents are well structured by rich ltx_xxxx CSS classnames. It should be lightning fast parsing those paper htmls and get very precise info. It would be cool to support arxiv html parsing, as a much faster branch or a strong hint for the pipeline.