natural icon indicating copy to clipboard operation
natural copied to clipboard

International Support

Open wbashir opened this issue 12 years ago • 10 comments

Any plans for international support, i guess i am trying to use the tokenizer to parse Arabic words

wbashir avatar Jun 24 '13 20:06 wbashir

I'm always looking for people to contribute algorithms pertaining to non-English languages. In the fall I hope to really ramp up this effort, but it will involve getting new people involved with the project.

chrisumbel avatar Aug 15 '13 16:08 chrisumbel

I also would need international support and might contribute to this.

My problem is that the tokenizer skips all accented characters, so a quick fix for me is to update the regex it uses.

@chrisumbel, how would you rather fix this issue ? Perhaps you have another implementation in mind, please let me know.

mef avatar Oct 15 '13 14:10 mef

My plans thus far were to ultimately break the modules up into language folders where applicable. Something like lib/stemmers/en, lib/stemmers/jp, lib/stemmers/fr

Since certain classes of algorithms, like string comparison/distance aren't applicable they would remain as is.

Everything will still reside in the natural project. Make sense or is that silly?

chrisumbel avatar Nov 01 '13 12:11 chrisumbel

Indeed FR support would be great. It could be a great kick start for a chatterbot. I'd like to test it in a module for SARAH (http://sarah.encausse.net)

Is there a list of project using natural ?

JpEncausse avatar Jan 08 '14 13:01 JpEncausse

I'm also interested in international support. Brazillian portuguese here... I need the tokenizer not to skip things like: ã, ó, ê, ç and so on...

My knowledge in NLP is very limited (not familiar with all these terminologies... I only go so far as of "tokenizer" lol), so I would be happy to contribute to this project but I would need some guidance on how to start to contribute... Like, what to I have to touch / modify to get this done?

A very basic tutorial for rookies would be nice. Like: "A stemmer is a thing that does this, a tokenizer does that, a classifier...."

Count me in to help grow this project

lfilho avatar Feb 24 '14 19:02 lfilho

@lfilho The tokenizer would be a pretty good place to start, a lot of other pieces rely on that.

Take a look here for a basic idea about tokenizers, in a nutshell the goal is to take some text and produce a list of 'tokens' or words in most NLP cases.

You can see here that we have a few tokenizers in different languages (although if you look here you'll see they may be under covered by unit tests) so they might be a helpful reference when creating your tokenizer.

EDIT: I would start with the agressive tokenizer it doesn't require much modification since its not super language dependent. Also there are some already built for other languages to give you an idea of the naming conventions we're using.

Feel free to ask if you have any questions, -Ken

kkoch986 avatar Feb 24 '14 20:02 kkoch986

So there we go. I did the tokenizer. Since I'm here, I don't think the spanish one is working. It suffers from the same problem I mentioned here with diacritic chars...

I'm also doing a new pull request shortly to add jasmine-node as dev dependecy

lfilho avatar Feb 24 '14 21:02 lfilho

Hello, do you know how to avoid tokenizer splitting words in foreign languages, so that fußball stays fußball and does not become fu s ball?

deemeetree avatar May 09 '14 11:05 deemeetree

@deemeetree answered in #152

kkoch986 avatar May 09 '14 13:05 kkoch986

I think for multilingual support you need to separate logic from content. For the natural library this means that there are algorithms and there are configurations. For instance, most tokenizers depend on the use of regular expressions to split a sentence. Develop one (maybe more are needed) algorithm for tokenization and provide expressions per language in a separate content folder (or repo). If you create a tokenizer you configure the tokenizer with language specific content/rules/etc. Likewise, the Brill POS tagger is already separated in algorithm and transformation rules. In the brill_pos_tagger folder you find a lib folder with the algorithm and a data folder with rules for English and Dutch. Parsers can be done similarly. This approach avoids creating a myriad of language specific code files.

Hugo

Hugo-ter-Doest avatar Mar 18 '18 13:03 Hugo-ter-Doest