Rounded OCR boudingbox and strict intersection judgement causes dropping OCR Textcells
Bug
TextCells created by OCR model did not show up in exported files.
Steps to reproduce
Note: Issue #1643 should be fixed first.
- Use a pdf, in which there is a picture and a line of text, both of them neighboring very closely (use attached pdfs)
- Use OCR model
- Export
Docling version
docling 2.33.0 docling-core 2.31.1 docling-ibm-models 3.4.3 docling-parse 4.0.1
Python version
python 3.11.9
Located problematic source code. https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/models/base_ocr_model.py#L47-L50
All XYs are rounded when creating bounding boxes for OCR model. While TextCells created by PageProcessingModel are not rounded.
If it is necessary to round all XYs, this block of source code has to be changed. https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/models/base_ocr_model.py#L117-L125
XYs of programmatic cells' bounding boxes are float numbers.
In my case. I got this result in which the first cell is overlapping with rest cells.
| type | bounding box |
|---|---|
| ocr cell | 54 , 311 , 553 , 588 |
| programmatic cell | 89.304 , 301.4579829 , 92.894 , 311.7969829 |
| programmatic cell | 544.56 , 571.7719829 , 547.496 , 581.2439829 |
A threshold of overlapping ratio is sufficient, but since rtree does not support a threshold, a post calculation after intersect is necessary, and this may slightly impact performance.
I ran can_reproduce_the_issue.pdf using Docling. Could you help identify which specific text cell is missing?
Docling version
docling 2.33.0 docling-core 2.32.0 docling-ibm-models 3.4.3 docling-parse 4.0.1
debug image ocr_page_cell debug image postprocessed_layout_page
I ran can_reproduce_the_issue.pdf using Docling. Could you help identify which specific text cell is missing?
Docling version
docling 2.33.0 docling-core 2.32.0 docling-ibm-models 3.4.3 docling-parse 4.0.1
debug image ocr_page_cell debug image postprocessed_layout_page
I see, you are using full page OCR, #1644 only occur when performing OCR on embedded images.