tesseract.el icon indicating copy to clipboard operation
tesseract.el copied to clipboard

Emacs integretion for Tesseract OCR

  • Tesseract.el

This package aims to provide an integration of [[https://github.com/tesseract-ocr/tesseract][Tesseract OCR]] in Emacs. The packages is still in early development and by far not feature complete nor free of bugs. I would however appreciate if you test it and give feedback to me, either directly over GitHub or via:

  • Prerequisites

You need of cause =Tesseract= installed on your system with the language packages you need. For PDF support you also need:

  • =ImageMagick=
  • =pdfjam=
  • Installation

Either download tesseract.el to a directory in your load path and put

#+BEGIN_SRC emacs-lisp (require 'tesseract) #+END_SRC

in your Emacs configuration file or if you prefer =use-package=:

#+BEGIN_SRC emacs-lisp (use-package tesseract :quelpa (tesseract :fetcher github :repo SebastianMeisel/tesseract.el) :ensure t) #+END_SRC

** Customization

Tesseract.el provides no keybindings. I recommend the following configuration. =doc-view-scale-internally= should be set to nil. Else some features might not work as expected.

If you use a language other then English (which is the default) you can customize the default language.

#+BEGIN_SRC emacs-lisp (use-package tesseract :quelpa (tesseract :fetcher github :repo SebastianMeisel/tesseract.el) :ensure t :custom (doc-view-scale-internally nil) ;; (tesseract/default-language "deu") :bind (:map dired-mode-map ("T t" . tesseract/dired/marked-to-txt) ("T c" . tesseract/dired/combine-marked-to-pdf) :map doc-view-mode-map ("T p" . tesseract/doc-view/ocr-current-page) ("T a" . tesseract/doc-view/ocr-this-pdf) ("T s" . tesseract/doc-view/ocr-current-slice))) #+END_SRC

#+RESULTS: : t

  • Usage

There is on global command provided:

  • =tesseract-change-language= lets you change the language currently used by any Tesseract command. You will be provided by a selection of choises available on your system. You have to install additional language packages if needed via your system package manager.

At the moment the package provides integration in Dired and Doc-View.

** Dired integration For Dired there are two command:

  1. =tesseract/dired/marked-to-txt= (Bound to ~T t~ in the example configuration). As the name suggests it runs =Tesseract= on each file that is marked (if it is supported) and creates a text file with the same base name. Supported file types are:

  2. =tesseract/dired/combine-marked-to-pdf= (Bound to ~T c~ in the example configuration). It takes a number of image files an combines the into a PDF file with text layer.

#+BEGIN_QUOTE Tesseract uses the Leptonica library to read images in one of these formats:

  • PNG - requires libpng, libz
  • JPEG - requires libjpeg / libjpeg-turbo
  • TIFF - requires libtiff, libz
  • JPEG 2000 - requires libopenjp2
  • GIF - requires libgif (giflib)
  • WebP (including animated WebP) - requires libwebp
  • BMP - no library required~*~
  • PNM - no library required~~ ~ Except Leptonica~ --- https://github.com/tesseract-ocr/tessdoc/blob/main/InputFormats.md #+END_QUOTE

PDF files are also supported by =Tesseract.el=. They are converted by =ImageMagick=.

If you run =tesseract/dired/marked-to-txt= with the ~C-u~ prefix (e.g. ~C-u T~), for all select PDF files a text layer is created in the original PDF instead of a text file.

** DocView integration

In =DocView= you have two commands:

  1. tesseract/doc-view/ocr-current-page (~T p~): Run =Tesseract= on the current page only.
  2. tesseract/doc-view/ocr-current-slice (~T s~): Run =Tesseract= on the slice of the current page, that is currently selected. You can select a slice in DocView by pressing ~c m~ and selecting a region of the current page with the mouse. See: [[info:Emacs#Document View][Info: Document View]].
  3. tesseract/doc-view/ocr-this-pdf (~T a~): Run =Tesseract= on the whole PDF file only. This only works in a PDF files and other situations are not handled correctly at the moment.

Local Variables:

jinx-languages: "en_US"

End: