OCRmyPDF icon indicating copy to clipboard operation
OCRmyPDF copied to clipboard

Searching math equations

Open karasjoh000 opened this issue 6 years ago • 2 comments

This is not an issue just a feature request.

It will be awesome if math equations and formulas were converted to latex expressions. When highlighting or searching for a formula, search by latex or copy latex. Not sure the difficulty of this task but it will be very useful.

karasjoh000 avatar Feb 25 '20 03:02 karasjoh000

https://guillaumegenthial.github.io/image-to-latex.html

karasjoh000 avatar Feb 25 '20 03:02 karasjoh000

It is possible to use Tesseract with the pseudo-language "equ" for equation detection. You could do ocrmypdf -l eng+equ for that (English + "equation language). I have not tried this before and I don't think it attempts to translate them to a latex representation. I'd be curious to hear how it works.

It sounds really interesting but I don't think I'd be able to seriously consider a project of that size unless someone were prepared to sponsor it.

jbarlow83 avatar Feb 25 '20 07:02 jbarlow83

The equ language has been removed from Tesseract 4+: https://github.com/tesseract-ocr/tessdata_fast/issues/4#issuecomment-572740279

kc9jud avatar Nov 30 '22 22:11 kc9jud

See also https://github.com/tesseract-ocr/tesseract/issues/3693

kc9jud avatar Nov 30 '22 22:11 kc9jud