PyMuPDF icon indicating copy to clipboard operation
PyMuPDF copied to clipboard

MacOS uses Tesseract and not Tesseract-OCR

Open avigoen opened this issue 7 months ago • 1 comments

Description of the bug

pymupdf/__init__.py in ?(tessdata)
  17818     # Unix-like systems:
  17819     cp = subprocess.run("whereis tesseract-ocr", shell=1, capture_output=1, check=0, text=True)
  17820     response = cp.stdout.strip().split()
  17821     if cp.returncode or len(response) != 2:  # if not 2 tokens: no tesseract-ocr
> 17822         raise RuntimeError("No tessdata specified and Tesseract is not installed")
  17823 
  17824     # search tessdata in folder structure
  17825     dirname = response[1]  # contains tesseract-ocr installation folder

RuntimeError: No tessdata specified and Tesseract is not installed

How to reproduce the bug

PyMuPDF installation command: uv add pymupdf

Issue:

for page in doc:
    textPage = page.get_textpage_ocr()
    print(textPage.extract_text())

On running the above script, I am getting the error

I can see that on MacOS, tesseract is installed using brew install tesseract and has no package for tesseract-ocr

Tesseract Installation Proof: tesseract: /opt/homebrew/bin/tesseract tesseract-ocr:

PyMuPDF version

1.26.1

Operating system

MacOS

Python version

3.12

avigoen avatar Jun 18 '25 05:06 avigoen

You know that you can fix this by either directly providing the folder name of tessdata or setting the appropriate environment variable (before starting your script)?

JorjMcKie avatar Jun 18 '25 12:06 JorjMcKie

Fixed in PyMuPDF-1.26.4.