bug/PaddleOCR language specification issue
Describe the bug
After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"] to partition_pdf with a strategy of either "auto" or "ocr_only", the OCR Agent is not passed through, which causes the following error to occur:
Traceback (most recent call last):
File "/<MY_DIR>/main.py", line 20, in <module>
elements = partition_pdf(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
elements = func(*args, **kwargs)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
return partition_pdf_or_image(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
elements = _partition_pdf_or_image_with_ocr(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
page_elements = _partition_pdf_or_image_with_ocr_from_image(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
ocr_agent = OCRAgent.get_agent(language=ocr_languages)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
return cls.get_instance(ocr_agent_cls_qname, language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
return loaded_class(language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
self.agent = self.load_agent(language)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
paddle_ocr = PaddleOCR(
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
lang, det_lang = parse_lang(params.lang)
File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng
To Reproduce
Setup:
Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr. I also had to run pip uninstall torch -y and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements
paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"
elements = partition_pdf(
filename=filename,
strategy="ocr_only",
languages=["en"],
table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)
Expected behavior Script would run without errors and return elements.
Screenshots If applicable, add screenshots to help explain your problem.
Environment Info
Please run python scripts/collect_env.py and paste the output here.
Broken Env:
OS version: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.17.0
unstructured-inference version: 0.8.9
pytesseract is not installed
Torch version: 2.6.0+cpu
Detectron2 is not installed
PaddleOCR version: 2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
Working Env:
OS version: Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version: 3.10.12
unstructured version: 0.16.25
unstructured-inference version: 0.8.9
pytesseract is not installed
Torch version: 2.6.0+cpu
Detectron2 is not installed
PaddleOCR version: 2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed
Additional context
From what I can tell, the ocr_agent isn't being passed to _partition_pdf_or_image_with_ocr_from_image. Since the tesseract_to_paddle_language call is only being done inside _partition_pdf_or_image_local > process_file_with_ocr > supplement_page_layout_with_ocr (which doesn't get called by _partition_pdf_or_image_with_ocr) passing languages=["en"] has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages) here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages) to break here
Hi @joshrbarcodefactory ,
Thanks for submitting this issue. It seems the problem actually comes from
os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
0.17.0 refactors the ocr agent so that the agent module path is passed in as a kwarg (as is in your code snippet) and NOT use the env any more. For now to get around it please just remove the line above from your code and unset the env if it is set from another source.
That said this does seem like a bug where the old way of specifying agent is mixing up with the new way of specifying agents so we will look into that as well.
Cheers, Yao