docling icon indicating copy to clipboard operation
docling copied to clipboard

Rounded OCR boudingbox and strict intersection judgement causes dropping OCR Textcells

Open Bill-XU opened this issue 8 months ago • 2 comments

Bug

TextCells created by OCR model did not show up in exported files.

Steps to reproduce

Note: Issue #1643 should be fixed first.

  1. Use a pdf, in which there is a picture and a line of text, both of them neighboring very closely (use attached pdfs)
  2. Use OCR model
  3. Export

Docling version

docling 2.33.0 docling-core 2.31.1 docling-ibm-models 3.4.3 docling-parse 4.0.1

Python version

python 3.11.9

can_reproduce_the_issue.pdf

cannot_reproduce_the_issue.pdf

Bill-XU avatar May 23 '25 03:05 Bill-XU

Located problematic source code. https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/models/base_ocr_model.py#L47-L50

All XYs are rounded when creating bounding boxes for OCR model. While TextCells created by PageProcessingModel are not rounded.

Bill-XU avatar May 23 '25 03:05 Bill-XU

If it is necessary to round all XYs, this block of source code has to be changed. https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/models/base_ocr_model.py#L117-L125

XYs of programmatic cells' bounding boxes are float numbers.

In my case. I got this result in which the first cell is overlapping with rest cells.

type bounding box
ocr cell 54 , 311 , 553 , 588
programmatic cell 89.304 , 301.4579829 , 92.894 , 311.7969829
programmatic cell 544.56 , 571.7719829 , 547.496 , 581.2439829

A threshold of overlapping ratio is sufficient, but since rtree does not support a threshold, a post calculation after intersect is necessary, and this may slightly impact performance.

Bill-XU avatar May 23 '25 03:05 Bill-XU

I ran can_reproduce_the_issue.pdf using Docling. Could you help identify which specific text cell is missing?

Docling version

docling 2.33.0 docling-core 2.32.0 docling-ibm-models 3.4.3 docling-parse 4.0.1

debug image ocr_page_cell debug image postprocessed_layout_page

yohan-cw avatar Jun 13 '25 08:06 yohan-cw

I ran can_reproduce_the_issue.pdf using Docling. Could you help identify which specific text cell is missing?

Docling version

docling 2.33.0 docling-core 2.32.0 docling-ibm-models 3.4.3 docling-parse 4.0.1

debug image ocr_page_cell debug image postprocessed_layout_page

I see, you are using full page OCR, #1644 only occur when performing OCR on embedded images.

Bill-XU avatar Jun 19 '25 06:06 Bill-XU