Wrongly assigned indices of TextCells in RapidOcrModel cause dropping celles in LayoutPostProcessor
Bug
TextCells created by RapidOcrModel disappeared from result.
Steps to reproduce
This is a bit complicated.
- Use a PDF file (use attached pdfs)
- LayoutModel cannot detect picutres in the PDF file as a cluster
- RapidOcrModel creates TextCells for these pictures.
- TextCells created did not show in exported files.
Docling version
docling 2.33.0 docling-core 2.31.1 docling-ibm-models 3.4.3 docling-parse 4.0.1
Python version
python 3.11.9
Located problematic source code. https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/models/rapid_ocr_model.py#L110-L134
Should give ix an offset, which is the maximum index of existing cells, like below.
_offset = max([_c.index for _c in page.cells]) + 1
FYR source code above conflicts with this method in LayoutPostProcessor https://github.com/docling-project/docling/blob/45265bf8b1a6d6ad5367bb3f17fb3fa9d4366a05/docling/utils/layout_postprocessor.py#L607-L614
Hello, I’ve encountered the same issue and wanted to share how I resolved it.
Since this problem also occurs across different OCR models (easyOCR makes same issue) , I think modifying the BaseOcrModel component could be a reasonable approach.
In my case, I addressed it by re-indexing the OCR text cells using the filtered ocr_cells, assigning their indices starting right after the last index of programmatic_cells.
https://github.com/docling-project/docling/blob/7a275c763731d9c96b7cf32f2e27b8dc8bebacd7/docling/models/base_ocr_model.py#L133-L145
Here's the change I made by introducing post_process_cells:
filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells)
# re-indexing part
next_programmatic_cell_index = len(programmatic_cells) + 1
for i, textcell in enumerate(filtered_ocr_cells):
textcell.index = next_programmatic_cell_index + i
programmatic_cells.extend(filtered_ocr_cells)
Hello, I’ve encountered the same issue and wanted to share how I resolved it.
Since this problem also occurs across different OCR models (easyOCR makes same issue) , I think modifying the
BaseOcrModelcomponent could be a reasonable approach.In my case, I addressed it by re-indexing the OCR text cells using the filtered
ocr_cells, assigning their indices starting right after the last index ofprogrammatic_cells.docling/docling/models/base_ocr_model.py
Lines 133 to 145 in 7a275c7
def post_process_cells(self, ocr_cells, programmatic_cells): r""" Post-process the ocr and programmatic cells and return the final list of of cells """ if self.options.force_full_page_ocr: # If a full page OCR is forced, use only the OCR cells cells = ocr_cells return cells ## Remove OCR cells which overlap with programmatic cells. filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells) programmatic_cells.extend(filtered_ocr_cells) return programmatic_cellsHere's the change I made by introducing
post_process_cells:filtered_ocr_cells = self._filter_ocr_cells(ocr_cells, programmatic_cells)
re-indexing part
next_programmatic_cell_index = len(programmatic_cells) + 1 for i, textcell in enumerate(filtered_ocr_cells): textcell.index = next_programmatic_cell_index + i
programmatic_cells.extend(filtered_ocr_cells)
Maybe you've encountered another issue in BaseOcrModel. You can refer another issue #1644 I posted along with this.
It looks like this was resolved in #1745 — should we go ahead and mark it as resolved?
It looks like this was resolved in #1745 — should we go ahead and mark it as resolved?
Many thanks for your great efforts!👍 And sorry for that, I can't judge it right now, there were too many changes in this PR. I didn't expect so many changes to base OCR model. 😅
Still I had a glance at the PR, it looks like that, you created a post process for OCR models instead of the one in layout processor, and uses _filter_ocr_cells for removing overlapped cells, which I think reasonable. (But the issue #1644 may cause unexpected results then, I think you will deal it later.)
Maybe we should mark this one as resolved and move on. Thank you again!
Best regards, Bill