bug/supplement_element_with_table_extraction is calling OCR on cropped image
Describe the bug
The results of extracting table information from the attached acciona.pdf file are underwhelming whereas the results of OCR via tesseract and pdfminer on the whole page are quite good. After some debugging, it appears that the following code runs an additional OCR step with tesseract on a cropped image of the page on line 281.
This is unfortunate because:
- the function already has the list of all LayoutElement on the page
- the additional OCR operation with
tesseractcan incur additional processing time - the results of
tesseracton the cropped image are not as good as those using the whole page.
https://github.com/Unstructured-IO/unstructured/blob/daaf1775b40ff5408e78b35f2fb8dce38694e0a6/unstructured/partition/pdf_image/ocr.py#L256-L289
To Reproduce
Place the acciona.pdf file in the example-docs folder of your checkout.
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="./example-docs/acciona.pdf",
strategy="hi_res",
infer_table_structure=True,
)
tables = [el for el in elements if el.category == "Table"]
print(tables[0].metadata.text_as_html)
Or see the following Colab notebook:
https://colab.research.google.com/drive/1KLtF8A_sA3cZChSvqV1cqbR4pcxXTg45?usp=sharing
The issue also exists in the Unstructured API. See the following Colab notebook:
https://colab.research.google.com/drive/1fqJtUo9OIvsbEeZTgoIrJ0NrvjcHLni4?usp=sharing
Expected behavior
Results consistent with the results of running tesseract or pdfminer on the whole page
Screenshots N/A
Environment Info OS version: Linux-6.5.0-21-generic-x86_64-with-glibc2.38 Python version: 3.9.17 unstructured version: None unstructured-inference version: 0.7.23 pytesseract version: 0.3.10 Torch version: 2.2.0 Detectron2 is not installed
[notice] A new release of pip is available: 23.2.1 -> 24.0 [notice] To update, run: pip install --upgrade pip
[notice] A new release of pip is available: 23.2.1 -> 24.0 [notice] To update, run: pip install --upgrade pip PaddleOCR is not installed Libmagic version: file-5.44 magic file from /etc/magic:/usr/share/misc/magic LibreOffice version: LibreOffice 7.6.4.1 60(Build:1)
Additional context
This issue is related to issue #1875.