rocketdocket
rocketdocket copied to clipboard
The fastest, cleanest, most reproducible ways to OCR a document.
ROCKETDOCKET
OCRing a PDF requires a few steps if you'd like to do it in parallel.
Currently, we're taking in a single large PDF and producing a single text file for each page.
Decisions should be made about what to do with those text files. Put 'em in Elasticsearch? Combine them back into a giant PDF? Both? Neither?
Ghostscript
time gs\
-o file-%05d.png\
-sDEVICE=pngmono\
-dNumRenderingThreads=$(sysctl -n hw.ncpu)\
-dBandHeight=100\
-dBufferSpace=1000000000\
-dBandBufferSpace=500000000\
-sBandListStorage=memory\
-dBATCH\
-dNOPAUSE\
-dNOGC\
-r72\
original.pdf
Explanation
-o file-%05d.png: Outputs file names likefile-00405.png, one for each page.-dNumRenderingThreads=$(sysctl -n hw.ncpu): Makes a thread for every CPU (or virtual CPU) your computer claims to have.-sDEVICE=pngmono: The fastest output, in my testing, is with thepngmonoengine, which is a black-and-white PNG with no transparency. PNG is also the fastest for Tesseract to process.-dBandHeight=100 -dBufferSpace=1000000000 -dBandBufferSpace=500000000 -sBandListStorage=memory: Sets up a large amount of memory for creating PNGs.-dBATCH -dNOPAUSE -dNOGC: From the internet, things you can turn off to increase speed.-r72: Produces a 72-dpi PNG image. Smaller numbers here are faster to generate, but OCR quality decreases quite a bit as you descend below this.
Tesseract
time ls *.png | xargs -n 1 -P $(sysctl -n hw.ncpu) ./ocr.sh
Explanation
ls *.png: Produces output where each filename is on a single line.| xargs: Takes the output ofls *.pngand feeds it toxargs-n 1: Each line contains something we'd like to process, aka, a PNG file path.-P $(sysctl -n hw.ncpu):xargsshould use one process per CPU (or virtual CPU) your computer claims to have../ocr.sh: Runs whatever is inocr.shwith the name of the PNG file as the argument.