docspell icon indicating copy to clipboard operation
docspell copied to clipboard

Please consider adding Mandarin language

Open iszhi opened this issue 2 years ago • 5 comments

I also have a lot of Documents written by Mandarin. Can you add this too?

iszhi avatar Apr 07 '23 05:04 iszhi

I'm not against it at all, but it is for me not really doable, since I have zero knowledge of Mandarin. The NLP processors don't support it afaik, but tesseract (the tool doing the OCR) has support for chinese traditional and simplified, don't know if that would help?

For date recognition I would need a PR or at the very least all the info from here

eikek avatar Apr 07 '23 08:04 eikek

I'm not against it at all, but it is for me not really doable, since I have zero knowledge of Mandarin. The NLP processors don't support it afaik, but tesseract (the tool doing the OCR) has support for chinese traditional and simplified, don't know if that would help? @eikek Since NLP don't support Mandarin, can you add it via tesseract? (PS. I don't know either NLP and tesseract exactly.)

iszhi avatar Apr 07 '23 10:04 iszhi

I think tesseract has support for simplified and traditional chinese - which one is better? It is possible to add it to the docker image and add a language option to the ui.

eikek avatar Apr 10 '23 18:04 eikek

In China, simplified Chinese is used in mainland China, and traditional Chinese is used in Taiwan and Hong Kong. Simplified Chinese means more user base. But if possible, I recommend installing two languages.

iszhi avatar Apr 10 '23 18:04 iszhi

I'm not against it at all, but it is for me not really doable, since I have zero knowledge of Mandarin. The NLP processors don't support it afaik, but tesseract (the tool doing the OCR) has support for chinese traditional and simplified, don't know if that would help?

For date recognition I would need a PR or at the very least all the info from here

Stanford CoreNLP support (mainland) Chinese.

Stanford CoreNLP [backup download page] An integrated suite of natural language processing tools for English, Spanish, and (mainland) Chinese in Java, including tokenization, part-of-speech tagging, named entity recognition, parsing, and coreference

kxu1988 avatar Mar 12 '24 14:03 kxu1988