unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/supplement_element_with_table_extraction is calling OCR on cropped image

Open kaaloo opened this issue 1 year ago • 1 comments

Describe the bug The results of extracting table information from the attached acciona.pdf file are underwhelming whereas the results of OCR via tesseract and pdfminer on the whole page are quite good. After some debugging, it appears that the following code runs an additional OCR step with tesseract on a cropped image of the page on line 281.

This is unfortunate because:

  • the function already has the list of all LayoutElement on the page
  • the additional OCR operation with tesseract can incur additional processing time
  • the results of tesseract on the cropped image are not as good as those using the whole page.

https://github.com/Unstructured-IO/unstructured/blob/daaf1775b40ff5408e78b35f2fb8dce38694e0a6/unstructured/partition/pdf_image/ocr.py#L256-L289

To Reproduce

Place the acciona.pdf file in the example-docs folder of your checkout.

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="./example-docs/acciona.pdf",
    strategy="hi_res",
    infer_table_structure=True,
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].metadata.text_as_html)

Or see the following Colab notebook:

https://colab.research.google.com/drive/1KLtF8A_sA3cZChSvqV1cqbR4pcxXTg45?usp=sharing

The issue also exists in the Unstructured API. See the following Colab notebook:

https://colab.research.google.com/drive/1fqJtUo9OIvsbEeZTgoIrJ0NrvjcHLni4?usp=sharing

Expected behavior Results consistent with the results of running tesseract or pdfminer on the whole page

Screenshots N/A

Environment Info OS version: Linux-6.5.0-21-generic-x86_64-with-glibc2.38 Python version: 3.9.17 unstructured version: None unstructured-inference version: 0.7.23 pytesseract version: 0.3.10 Torch version: 2.2.0 Detectron2 is not installed

[notice] A new release of pip is available: 23.2.1 -> 24.0 [notice] To update, run: pip install --upgrade pip

[notice] A new release of pip is available: 23.2.1 -> 24.0 [notice] To update, run: pip install --upgrade pip PaddleOCR is not installed Libmagic version: file-5.44 magic file from /etc/magic:/usr/share/misc/magic LibreOffice version: LibreOffice 7.6.4.1 60(Build:1)

Additional context

acciona.pdf

kaaloo avatar Feb 23 '24 18:02 kaaloo

This issue is related to issue #1875.

christinestraub avatar Feb 23 '24 19:02 christinestraub