adapt Tokenizer Internationalization

We should test to see if the EnglishTokenizer impl is sufficient for French, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

Jan 08 '16 17:01 clusterfudge

For words such as "j'ajoute", I would like "ajoute" to be a word (a keyword actually) but it doesn't work.

I think french tokenizer is pretty similar to the english one except for this quote rule (which has exceptions such as words like "aujourd'hui").

Mar 11 '16 23:03 gcrieloue-main

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.

Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027

Mar 15 '18 17:03 penrods

Hello,

While it's and it is are both valid in English, Sadly "je aime" is not valid in French.

(And btw Amie is not a verb, it means friend)

Le jeu. 15 mars 2018 à 18:07, Steve Penrod [email protected] a écrit :

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.

Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/adapt/issues/3#issuecomment-373451760, or mute the thread https://github.com/notifications/unsubscribe-auth/AE9-PRNB7k1UT6fEZVi3QdopPojwD2i2ks5tep_egaJpZM4HBT-4 .

Mar 15 '18 17:03 gcrieloue-main

C'est la vie! There is a reason I shouldn't be the one implementing the French parsers. :)

Mar 16 '18 07:03 penrods

Tokenizer Internationalization - French