docling icon indicating copy to clipboard operation
docling copied to clipboard

docling_parse_v2 split/connect words

Open InbarShapira opened this issue 11 months ago • 1 comments

Bug

There are cases that docling_parser_v2 spilt words to it characters or connect words

Example1: Original text: products that were recently iroduced markdown: products that were re c e n t l y i roduced

Example2: Original text: Tables 2–5 show the results of partitioning the graphs in our test suite on markdown: Tables 2-5 sho w theresultsfpartitioningegraphsinourtest suite on

Steps to reproduce

...

Docling version

Docling version: 2.21.0 Docling Core version: 2.18.0 Docling IBM Models version: 3.3.0 Docling Parse version: 3.3.0 Python: cpython-311 (3.11.4) Platform: macOS-14.6.1-arm64-arm-64bit

Python version

Python 3.11.4

InbarShapira avatar Feb 12 '25 20:02 InbarShapira

@InbarShapira Can you please provide an example?

PeterStaar-IBM avatar Feb 13 '25 06:02 PeterStaar-IBM

This is almost certainly https://github.com/DS4SD/docling-parse/issues/99

dhdaines avatar Mar 06 '25 17:03 dhdaines

If you downgrade to docling-parse 3.1.2 (you will also need to downgrade docling to 2.21 at least I think) does the issue still occur?

dhdaines avatar Mar 06 '25 17:03 dhdaines

you can also of course try this PR! https://github.com/DS4SD/docling-parse/pull/105

dhdaines avatar Mar 06 '25 17:03 dhdaines