unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug: TesseractError: Estimating resolution as X

Open qued opened this issue 1 year ago • 1 comments

Describe the bug User gets a TesseractError when processing a particular document.

To Reproduce Code was an API call with a certain image-based document.

Expected behavior Document processed successfully.

Environment Info Running in self-hosted open-source API. Unstructured version 0.12.3. Tesseract version 5.3.3

Additional context User was able to successfully process the document with Tesseract version 4.1.1

Stack trace:

File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
    return partition_pdf_or_image(
  File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
    elements = _partition_pdf_or_image_local(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
    final_document_layout = process_data_with_ocr(
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
    merged_layouts = process_file_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
    raise e
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
    merged_page_layout = supplement_page_layout_with_ocr(
  File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
    return func(*args, **kwargs)
  File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
    ocr_layout = ocr_agent.get_layout_from_image(
  File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
    ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
    return {
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
    Output.DATAFRAME: lambda: get_pandas_output(
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
    return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
    run_tesseract(**kwargs)
  File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
    raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')

qued avatar Apr 17 '24 17:04 qued

Slack conversation: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1713364225537139

We've previously encountered this error in #1920 and closed the issue with #1996. The user is running a version of unstructured with the fix merged, so presumably this is the same error showing up for a different reason.

qued avatar Apr 17 '24 17:04 qued

@qued, @scanny : Any update on the above issue ?

esakes1 avatar May 28 '24 08:05 esakes1

@esakes1 Can you say more about what you're seeing and when? In particular which specific error message (including estimated resolution).

And can you provide an example document with which we can reproduce?

scanny avatar May 28 '24 18:05 scanny