bug: TesseractError: Estimating resolution as X
Describe the bug
User gets a TesseractError when processing a particular document.
To Reproduce Code was an API call with a certain image-based document.
Expected behavior Document processed successfully.
Environment Info Running in self-hosted open-source API. Unstructured version 0.12.3. Tesseract version 5.3.3
Additional context User was able to successfully process the document with Tesseract version 4.1.1
Stack trace:
File "/home/notebook-user/unstructured/partition/pdf.py", line 213, in partition_pdf
return partition_pdf_or_image(
File "/home/notebook-user/unstructured/partition/pdf.py", line 298, in partition_pdf_or_image
elements = _partition_pdf_or_image_local(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf.py", line 494, in _partition_pdf_or_image_local
final_document_layout = process_data_with_ocr(
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 82, in process_data_with_ocr
merged_layouts = process_file_with_ocr(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 178, in process_file_with_ocr
raise e
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 166, in process_file_with_ocr
merged_page_layout = supplement_page_layout_with_ocr(
File "/home/notebook-user/unstructured/utils.py", line 214, in wrapper
return func(*args, **kwargs)
File "/home/notebook-user/unstructured/partition/pdf_image/ocr.py", line 202, in supplement_page_layout_with_ocr
ocr_layout = ocr_agent.get_layout_from_image(
File "/home/notebook-user/unstructured/partition/utils/ocr_models/tesseract_ocr.py", line 48, in get_layout_from_image
ocr_df: pd.DataFrame = unstructured_pytesseract.image_to_data(
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 591, in image_to_data
return {
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 593, in <lambda>
Output.DATAFRAME: lambda: get_pandas_output(
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 568, in get_pandas_output
return pd.read_csv(BytesIO(run_and_get_output(*args)), **kwargs)
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 347, in run_and_get_output
run_tesseract(**kwargs)
File "/usr/local/lib/python3.10/site-packages/unstructured_pytesseract/pytesseract.py", line 279, in run_tesseract
raise TesseractError(proc.returncode, get_errors(error_string))
unstructured_pytesseract.pytesseract.TesseractError: (-8, 'Estimating resolution as 252')
Slack conversation: https://unstructuredw-kbe4326.slack.com/archives/C044N0YV08G/p1713364225537139
We've previously encountered this error in #1920 and closed the issue with #1996. The user is running a version of unstructured with the fix merged, so presumably this is the same error showing up for a different reason.
@qued, @scanny : Any update on the above issue ?
@esakes1 Can you say more about what you're seeing and when? In particular which specific error message (including estimated resolution).
And can you provide an example document with which we can reproduce?