bug/PaddleOCR language specification issue

Open joshrbarcodefactory opened this issue 10 months ago • 1 comments

Describe the bug After updating to version 0.17.0, I am experiencing the same issue as #3400. When supplying languages=["en"] to partition_pdf with a strategy of either "auto" or "ocr_only", the OCR Agent is not passed through, which causes the following error to occur:

Traceback (most recent call last):
  File "/<MY_DIR>/main.py", line 20, in <module>
    elements = partition_pdf(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/documents/elements.py", line 581, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 816, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/file_utils/filetype.py", line 774, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/chunking/dispatch.py", line 74, in wrapper
    elements = func(*args, **kwargs)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 228, in partition_pdf
    return partition_pdf_or_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 379, in partition_pdf_or_image
    elements = _partition_pdf_or_image_with_ocr(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 934, in _partition_pdf_or_image_with_ocr
    page_elements = _partition_pdf_or_image_with_ocr_from_image(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/pdf.py", line 962, in _partition_pdf_or_image_with_ocr_from_image
    ocr_agent = OCRAgent.get_agent(language=ocr_languages)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 34, in get_agent
    return cls.get_instance(ocr_agent_cls_qname, language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/ocr_interface.py", line 49, in get_instance
    return loaded_class(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 23, in __init__
    self.agent = self.load_agent(language)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured/partition/utils/ocr_models/paddle_ocr.py", line 45, in load_agent
    paddle_ocr = PaddleOCR(
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 610, in __init__
    lang, det_lang = parse_lang(params.lang)
  File "/<MY_DIR>/.venv/lib/python3.10/site-packages/unstructured_paddleocr/paddleocr.py", line 479, in parse_lang
    lang in MODEL_URLS["OCR"][DEFAULT_OCR_MODEL_VERSION]["rec"]
AssertionError: param lang must in dict_keys(['ch', 'en', 'korean', 'japan', 'chinese_cht', 'ta', 'te', 'ka', 'latin', 'arabic', 'cyrillic', 'devanagari']), but got eng

To Reproduce Setup: Run pip install "unstructured[pdf]"==0.17.0 paddlepaddle unstructured.paddleocr. I also had to run pip uninstall torch -y and then pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

from unstructured.partition.pdf import partition_pdf
import unstructured.partition.utils.ocr_models.paddle_ocr as paddle_ocr_module
from unstructured_inference.inference.layoutelement import LayoutElements

paddle_ocr_module.LayoutElements = LayoutElements # workaround for #3931

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue
os.environ["DEFAULT_PADDLE_LANG"] = "en" # found in an old issue
filename = "path_to_your_file.pdf"

elements = partition_pdf(
    filename=filename,
    strategy="ocr_only",
    languages=["en"],
    table_ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
    ocr_agent="unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle",
)

Expected behavior Script would run without errors and return elements.

Screenshots If applicable, add screenshots to help explain your problem.

Environment Info Please run python scripts/collect_env.py and paste the output here. Broken Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.17.0
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Working Env:

OS version:  Linux-6.8.0-52-generic-x86_64-with-glibc2.35
Python version:  3.10.12
unstructured version:  0.16.25
unstructured-inference version:  0.8.9
pytesseract is not installed
Torch version:  2.6.0+cpu
Detectron2 is not installed
PaddleOCR version:  2.6.2
Libmagic version: file-5.41
magic file from /etc/magic:/usr/share/misc/magic
LibreOffice is not installed

Additional context From what I can tell, the ocr_agent isn't being passed to _partition_pdf_or_image_with_ocr_from_image. Since the tesseract_to_paddle_language call is only being done inside _partition_pdf_or_image_local > process_file_with_ocr > supplement_page_layout_with_ocr (which doesn't get called by _partition_pdf_or_image_with_ocr) passing languages=["en"] has the languages changed to the tesseract language structure ocr_languages = prepare_languages_for_tesseract(languages) here, which causes the call to ocr_agent = OCRAgent.get_agent(language=ocr_languages) to break here

Mar 14 '25 18:03 joshrbarcodefactory

Hi @joshrbarcodefactory ,

Thanks for submitting this issue. It seems the problem actually comes from

os.environ["OCR_AGENT"] = "unstructured.partition.utils.ocr_models.paddle_ocr.OCRAgentPaddle" # found in an old issue

0.17.0 refactors the ocr agent so that the agent module path is passed in as a kwarg (as is in your code snippet) and NOT use the env any more. For now to get around it please just remove the line above from your code and unset the env if it is set from another source.

That said this does seem like a bug where the old way of specifying agent is mixing up with the new way of specifying agents so we will look into that as well.

Cheers, Yao

Mar 18 '25 20:03 badGarnet