docling_parse_v2 split/connect words
Bug
There are cases that docling_parser_v2 spilt words to it characters or connect words
Example1: Original text: products that were recently iroduced markdown: products that were re c e n t l y i roduced
Example2: Original text: Tables 2–5 show the results of partitioning the graphs in our test suite on markdown: Tables 2-5 sho w theresultsfpartitioningegraphsinourtest suite on
Steps to reproduce
...
Docling version
Docling version: 2.21.0 Docling Core version: 2.18.0 Docling IBM Models version: 3.3.0 Docling Parse version: 3.3.0 Python: cpython-311 (3.11.4) Platform: macOS-14.6.1-arm64-arm-64bit
Python version
Python 3.11.4
@InbarShapira Can you please provide an example?
This is almost certainly https://github.com/DS4SD/docling-parse/issues/99
If you downgrade to docling-parse 3.1.2 (you will also need to downgrade docling to 2.21 at least I think) does the issue still occur?
you can also of course try this PR! https://github.com/DS4SD/docling-parse/pull/105