Is always generating a file with 306 bytes
Hello, I'm testing this program to convert medical books in brazilian portuguese. These books have around 500-700 pages at good quality and, after install all I need to run pypdfocr (one exclusively box for this :), tesseract 3.03 and some of others requirements) when I run it [1] looks like fine, so the product of execution is a file with sufix _ocr.pdf sizing 306 bytes. Its content [2] show nothing good.
What may be wrong!?
1 - Generating OCR of MyBookInPortuguese.pdf - 227 MegaBytes
root@vagrant-ubuntu-trusty-64:/vagrant# pypdfocr -v -l por MyBookInPortuguese.pdf Starting conversion of MyBookInPortuguese.pdf Running pdfimages to figure out DPI... Using 300 DPI Detected color gs -q -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=75 -r300 -sOutputFile="MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_%d.jpg" "MyBookInPortuguese.pdf" -c quit Skipping preprocess step Checking tesseract version tesseract -v Created OCR'ed pdf as MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_ocr.pdf Cleaning up [] Cleaning up [] Cleaning up [] Cleaning up [] Cleaning up [] Completed conversion successfully to MyBookInPortuguese.pdf_ocr.pdf
2 - MyBookInPortuguese.pdf_ocr.pdf - 306 bytes
%PDF-1.3 1 0 obj > endobj 2 0 obj > endobj 3 0 obj > endobj xref 0 4 0000000000 65535 f 0000000009 00000 n 0000000062 00000 n 0000000102 00000 n trailer > startxref 151 %%EOF
Hi. Can't do anything without a test case. Please upload a pdf so I can try to reproduce.
Did you installed tesseract-data-por (data files for portuguese language) in your distro?
at least in archlinux, tesseract package supports only english language by default. If you need to support other languages, you need to install tesseract-data-
For portuguese language support in archlinux you'll need to run the folowing command:
#pacman -S tesseract-data-por
Sorry for my poor english.