pypdfocr icon indicating copy to clipboard operation
pypdfocr copied to clipboard

Is always generating a file with 306 bytes

Open caitifbrito opened this issue 9 years ago • 2 comments

Hello, I'm testing this program to convert medical books in brazilian portuguese. These books have around 500-700 pages at good quality and, after install all I need to run pypdfocr (one exclusively box for this :), tesseract 3.03 and some of others requirements) when I run it [1] looks like fine, so the product of execution is a file with sufix _ocr.pdf sizing 306 bytes. Its content [2] show nothing good.

  What may be wrong!?

1 - Generating OCR of MyBookInPortuguese.pdf - 227 MegaBytes

root@vagrant-ubuntu-trusty-64:/vagrant# pypdfocr -v -l por MyBookInPortuguese.pdf

Starting conversion of MyBookInPortuguese.pdf
Running pdfimages to figure out DPI...
Using 300 DPI
Detected color
gs -q -dNOPAUSE -sDEVICE=jpeg -dJPEGQ=75 -r300 -sOutputFile="MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_%d.jpg" "MyBookInPortuguese.pdf" -c quit
Skipping preprocess step
Checking tesseract version
tesseract -v
Created OCR'ed pdf as MyBookInPortuguese.pdf - 9ª Ed [ptbr+foto]_ocr.pdf
Cleaning up []
Cleaning up []
Cleaning up []
Cleaning up []
Cleaning up []
Completed conversion successfully to MyBookInPortuguese.pdf_ocr.pdf

2 - MyBookInPortuguese.pdf_ocr.pdf - 306 bytes

%PDF-1.3
1 0 obj
>
endobj
2 0 obj
>
endobj
3 0 obj
>
endobj
xref
0 4
0000000000 65535 f 
0000000009 00000 n 
0000000062 00000 n 
0000000102 00000 n 
trailer
>
startxref
151
%%EOF

caitifbrito avatar Feb 15 '17 02:02 caitifbrito

Hi. Can't do anything without a test case. Please upload a pdf so I can try to reproduce.

virantha avatar Feb 15 '17 15:02 virantha

Did you installed tesseract-data-por (data files for portuguese language) in your distro?

at least in archlinux, tesseract package supports only english language by default. If you need to support other languages, you need to install tesseract-data- package for your distro.

For portuguese language support in archlinux you'll need to run the folowing command:

#pacman -S tesseract-data-por

Sorry for my poor english.

DiegoAscanio avatar Mar 23 '17 00:03 DiegoAscanio