docling icon indicating copy to clipboard operation
docling copied to clipboard

Wrongly assigned indices of TextCells in RapidOcrModel cause dropping celles in LayoutPostProcessor

Open Bill-XU opened this issue 8 months ago • 2 comments

Bug

TextCells created by RapidOcrModel disappeared from result.

Steps to reproduce

This is a bit complicated.

  1. Use a PDF file (use attached pdfs)
  2. LayoutModel cannot detect picutres in the PDF file as a cluster
  3. RapidOcrModel creates TextCells for these pictures.
  4. TextCells created did not show in exported files.

Docling version

docling 2.33.0 docling-core 2.31.1 docling-ibm-models 3.4.3 docling-parse 4.0.1

Python version

python 3.11.9

can_reproduce_the_issue.pdf

cannot_reproduce_the_issue.pdf

Bill-XU avatar May 23 '25 02:05 Bill-XU

Located problematic source code. https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/models/rapid_ocr_model.py#L110-L134

Should give ix an offset, which is the maximum index of existing cells, like below.

_offset = max([_c.index for _c in page.cells]) + 1

Bill-XU avatar May 23 '25 03:05 Bill-XU

FYR source code above conflicts with this method in LayoutPostProcessor https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/utils/layout_postprocessor.py#L607-L614

Bill-XU avatar May 23 '25 03:05 Bill-XU

Hello, I’ve encountered the same issue and wanted to share how I resolved it.

Since this problem also occurs across different OCR models (easyOCR makes same issue) , I think modifying the BaseOcrModel component could be a reasonable approach.

In my case, I addressed it by re-indexing the OCR text cells using the filtered ocr_cells, assigning their indices starting right after the last index of programmatic_cells.

https://github.com/docling-project/docling/blob/7a275c763731d9c96b7cf32f2e27b8dc8bebacd7/docling/models/base_ocr_model.py#L133-L145

Here's the change I made by introducing post_process_cells:

filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells)

# re-indexing part
next_programmatic_cell_index = len(programmatic_cells) + 1
for i, textcell in enumerate(filtered_ocr_cells):
    textcell.index = next_programmatic_cell_index + i
        
programmatic_cells.extend(filtered_ocr_cells)

yohan-cw avatar Jun 12 '25 01:06 yohan-cw

Hello, I’ve encountered the same issue and wanted to share how I resolved it.

Since this problem also occurs across different OCR models (easyOCR makes same issue) , I think modifying the BaseOcrModel component could be a reasonable approach.

In my case, I addressed it by re-indexing the OCR text cells using the filtered ocr_cells, assigning their indices starting right after the last index of programmatic_cells.

docling/docling/models/base_ocr_model.py

Lines 133 to 145 in 7a275c7

 def post_process_cells(self, ocr_cells, programmatic_cells): 
     r""" 
     Post-process the ocr and programmatic cells and return the final list of of cells 
     """ 
     if self.options.force_full_page_ocr: 
         # If a full page OCR is forced, use only the OCR cells 
         cells = ocr_cells 
         return cells 

     ## Remove OCR cells which overlap with programmatic cells. 
     filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells) 
     programmatic_cells.extend(filtered_ocr_cells) 
     return programmatic_cells 

Here's the change I made by introducing post_process_cells:

filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells)

re-indexing part

next_programmatic_cell_index = len(programmatic_cells) + 1 for i, textcell in enumerate(filtered_ocr_cells): textcell.index = next_programmatic_cell_index + i

programmatic_cells.extend(filtered_ocr_cells)

Maybe you've encountered another issue in BaseOcrModel. You can refer another issue #1644 I posted along with this.

Bill-XU avatar Jun 12 '25 01:06 Bill-XU

It looks like this was resolved in #1745 — should we go ahead and mark it as resolved?

yohan-cw avatar Jul 23 '25 11:07 yohan-cw

It looks like this was resolved in #1745 — should we go ahead and mark it as resolved?

Many thanks for your great efforts!👍 And sorry for that, I can't judge it right now, there were too many changes in this PR. I didn't expect so many changes to base OCR model. 😅

Still I had a glance at the PR, it looks like that, you created a post process for OCR models instead of the one in layout processor, and uses _filter_ocr_cells for removing overlapped cells, which I think reasonable. (But the issue #1644 may cause unexpected results then, I think you will deal it later.)

Maybe we should mark this one as resolved and move on. Thank you again!

Best regards, Bill

Bill-XU avatar Jul 24 '25 05:07 Bill-XU