PyMuPDF Cannot get Tessdata with Tesseract-OCR 5

Description of the bug

The pymupdf.get_tessdata() function raises an unexpected error when the installed version of Tesseract OCR is not 4.0 (tested on the latest Debian, with Tesseract 5).

>>> import pymupdf
>>> pymupdf.get_tessdata()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "<...>/venv/lib/python3.11/site-packages/pymupdf/__init__.py", line 18082, in get_tessdata
    for sub_response in response.iterdir():
                        ^^^^^^^^^^^^^^^^
AttributeError: 'list' object has no attribute 'iterdir'

>>> pymupdf.version
('1.24.9', '1.24.8', '20240724000001')

How to reproduce the bug

I haven't looked into the details yet, but I think the problem lays here: https://github.com/pymupdf/PyMuPDF/blob/eca70661ae29a75aa4150a4a77f9b8d4e81979cc/src/init.py#L18093-L18099

I have the latest Debian with Tesseract OCR 5.3.0, installed in /usr/share/tesseract-ocr/5/tessdata/. The function get_tessdata() expects it in /usr/share/tesseract-ocr/4.00/tessdata, else it will search it with whereis tesseract-ocr.

However, it tries to iterdir on the subprocess response, even though it's a list of bytes, which raises the error.

>>> import subprocess
>>> cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
>>> cp
CompletedProcess(args='whereis tesseract-ocr', returncode=0, stdout=b'tesseract-ocr: /usr/share/tesseract-ocr\n', stderr=b'')
>>> response = cp.stdout.strip().split()
>>> response
[b'tesseract-ocr:', b'/usr/share/tesseract-ocr']
>>> type(response), type(response[0])
(<class 'list'>, <class 'bytes'>)
>>> 
>>> response.iterdir()
Traceback (most recent call last):
  File "<console>", line 1, in <module>
AttributeError: 'list' object has no attribute 'iterdir'

I don't quite know the inner workings of Tesseract or Pymupdf, but it seems that this functions is looking for a sub-sub-folder whose name ends with tessdata, and should find it in the second part of response. So I guess something like this should work?

import subprocess
cp = subprocess.run('whereis tesseract-ocr', shell=1, capture_output=1, check=0)
response = cp.stdout.strip().split()
import pathlib
response_dir = pathlib.Path(response[1].decode("utf-8"))
# response_dir == PosixPath('/usr/share/tesseract-ocr')
for sub_dir in response_dir.iterdir():
    for sub_sub_dir in sub_dir.iterdir():
        if sub_sub_dir.name.endswith("tessdata"):
            tessdata = str(sub_sub_dir)
            break
# tessdata == '/usr/share/tesseract-ocr/5/tessdata'

Yeah, I know I should set the TESSDATA_PREFIX environment variable anyway, but as the expected 4.0 version of Tesseract OCR is about six years old now, and no longer seems to be in the Debian repos, I guess it wouldn't harm to handle this case (unless the 5.0 is unsupported)?

Thanks for developing PyMuPDF! :)

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.11

Aug 10 '24 13:08 rezemika

MuPDF contains Tesseract 4.0 code to perform the OCR - it is integral part of the MuPDF binary.

The MuPDF team has stated that release 5.0 behavior is far less stable / predictable as necessary for MuPDF's purposes - details for this assessment should be best discussed with the team directly, e.g. on this Discord channel.

So what PyMuPDF's OCR is actually needed is exclusively the tessdata (language support) folder. I cannot say whether a 5.0 tessdata has a format compatible to one of release 4.0. But I definitely would suggest to use either the environment variable or the tessdata parameter.

Independently of the aforementioned, we should correct the behavior of the pymupdf function.

Aug 11 '24 09:08 JorjMcKie

Oh my bad, thanks for these details!

Aug 12 '24 10:08 rezemika

No problem. I made the tesseract installation detector version-independent. But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata.

Aug 12 '24 10:08 JorjMcKie

Fixed in 1.24.10.

Sep 02 '24 16:09 julian-smith-artifex-com

[>] No problem. I made the tesseract installation detector version-independent. But as I said: the MuPDF code is Tesseract 4.00, and I don't know what happens if it is confronted with a version 5 tessdata.

Unfortunately, errors occur if you use user training done in the jTessBoxEditor program with the version of Tesseract 5. But! in Tesseract 5, everything is recognized perfectly. I get the following errors in mutool draw: Error: LSTM requested, but not present!! Loading tesseract no best word!! no best word!! no best word!! no best word!! ....

I ask for help. I have words, signs, etc. when building tesseract. They have been successfully added to the logs of the jtessboxeditor program. @JorjMcKie In the fifth version of Tesseract, support for learning Tesseract 4.0 dictionaries has officially been discontinued, where two files are created, LSTM and LSTMF.

Nov 15 '24 12:11 mstr11

@mstr11 Sorry, MuPDF is the host for PyMuPDF's Tesseract OCR support. The associated code in MuPDF is based on Tesseract version 4. I am afraid you need to contact the MuPDF team to discuss this problem.

Nov 17 '24 21:11 JorjMcKie

Thanks for the information . Unfortunately , the site does not accept an invitation to chat . The time is up . I was only able to log in now and only through a vpn . Due to circumstances beyond my control, I won't be able to leave a message there, but I want to help the project.

Nov 18 '24 05:11 mstr11