PyMuPDF
PyMuPDF copied to clipboard
MacOS uses Tesseract and not Tesseract-OCR
Description of the bug
pymupdf/__init__.py in ?(tessdata)
17818 # Unix-like systems:
17819 cp = subprocess.run("whereis tesseract-ocr", shell=1, capture_output=1, check=0, text=True)
17820 response = cp.stdout.strip().split()
17821 if cp.returncode or len(response) != 2: # if not 2 tokens: no tesseract-ocr
> 17822 raise RuntimeError("No tessdata specified and Tesseract is not installed")
17823
17824 # search tessdata in folder structure
17825 dirname = response[1] # contains tesseract-ocr installation folder
RuntimeError: No tessdata specified and Tesseract is not installed
How to reproduce the bug
PyMuPDF installation command:
uv add pymupdf
Issue:
for page in doc:
textPage = page.get_textpage_ocr()
print(textPage.extract_text())
On running the above script, I am getting the error
I can see that on MacOS, tesseract is installed using brew install tesseract and has no package for tesseract-ocr
Tesseract Installation Proof:
tesseract: /opt/homebrew/bin/tesseract
tesseract-ocr:
PyMuPDF version
1.26.1
Operating system
MacOS
Python version
3.12
You know that you can fix this by either directly providing the folder name of tessdata or setting the appropriate environment variable (before starting your script)?
Fixed in PyMuPDF-1.26.4.