langdata issues

Issue 1392: Vietnamese dictionaries

3

https://code.google.com/p/tesseract-ocr/issues/detail?id=1392 What steps will reproduce the problem? 1. Unpack vie.traineddata downloaded from Tesseract repository 2. Run dawg2wordlist on vie.freq-dawg & vie.word-dawg to recover original lists 3. Examine the content What...

jimregan

urdu.wordlist

alonehoney

Language Request: Kurdish Sorani (Central Kurdish)

1

There's already a trained data file for the Latin dialect of the Kurdish language. Sorani dialect is the second most used dialect of the language and it'd be amazing to...

makwanbarzan

Hebrew issues

63

Here, i'm going to raise some issues related to Tesseract's Hebrew support. Dear participants who have interest in Arabic support, I suggest to raise Arabic issues in a separate 'issue',...

amitdo

Arabic Numbers

1

AhmadAlhati

Some characters missing in spa.training_text makes Tesseract fail recognizing them

2

When running unicharset_extractor on the Spanish langdata, it warns that capital "Ñ", capital "É" and "«" are absent from the training text (while their counterparts, "ñ", "é" and "»", are...

diegodlh