langdata icon indicating copy to clipboard operation
langdata copied to clipboard

Add Filipino lang

Open JohnHenryGaspay opened this issue 8 years ago • 7 comments

Would it be also good if you guys can support filipino language here in the Philippines.

JohnHenryGaspay avatar Aug 06 '17 13:08 JohnHenryGaspay

@theraysmith I've noticed that there is no Filipino language on the list of data.

JohnHenryGaspay avatar Aug 06 '17 13:08 JohnHenryGaspay

https://github.com/tesseract-ocr/tessdata/raw/master/best/fil.traineddata

amitdo avatar Aug 06 '17 13:08 amitdo

My training text corpus does not distinguish between fil and tgl, while they show up in ISO-639-2T as distinct. For some reason that I can't remember now, the language code has switched from tgl to fil in the "best" models that I pushed recently.

Does the fil language do what you want? If not please try to explain why. You could also try Latin, which attempts to cover all latin-based languages.

theraysmith avatar Aug 08 '17 00:08 theraysmith

@amitdo I've tried adding it to the language folders but when selecting fil as language the app always shut down.

JohnHenryGaspay avatar Aug 18 '17 09:08 JohnHenryGaspay

@theraysmith Yes our national language here in the Philippines is Filipino(fil) and tagalog(tgl) is the old name for that. I've tried the Latin but it's not working.

JohnHenryGaspay avatar Aug 18 '17 09:08 JohnHenryGaspay

I tested just now, with both best/fil and tgl (4.00.00alpha traineddatas) and they work with tesseract built from latest github code.

 tesseract fil-test.png fil-test-best-fil --oem 1 --psm 6 -l best/fil --tessdata-dir ../

 tesseract fil-test.png fil-test-tgl --oem 1 --psm 6 -l tgl  --tessdata-dir ../

Files attached. To me best/fil seems more accurate. I took a snapshot from tgl wikipedia page.

fil-test-tgl.txt fil-test-best-fil.txt fil-test

Shreeshrii avatar Aug 18 '17 09:08 Shreeshrii

I've tried adding it to the language folders but when selecting fil as language the app always shut down.

You should try running Tesseract from the command-line.

amitdo avatar Aug 18 '17 10:08 amitdo