adapt icon indicating copy to clipboard operation
adapt copied to clipboard

Tokenizer Internationalization - French

Open clusterfudge opened this issue 10 years ago • 4 comments

We should test to see if the EnglishTokenizer impl is sufficient for French, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.

clusterfudge avatar Jan 08 '16 17:01 clusterfudge

For words such as "j'ajoute", I would like "ajoute" to be a word (a keyword actually) but it doesn't work.

I think french tokenizer is pretty similar to the english one except for this quote rule (which has exceptions such as words like "aujourd'hui").

gcrieloue-main avatar Mar 11 '16 23:03 gcrieloue-main

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.

Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027

penrods avatar Mar 15 '18 17:03 penrods

Hello,

While it's and it is are both valid in English, Sadly "je aime" is not valid in French.

(And btw Amie is not a verb, it means friend)

Le jeu. 15 mars 2018 à 18:07, Steve Penrod [email protected] a écrit :

I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.

Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/adapt/issues/3#issuecomment-373451760, or mute the thread https://github.com/notifications/unsubscribe-auth/AE9-PRNB7k1UT6fEZVi3QdopPojwD2i2ks5tep_egaJpZM4HBT-4 .

gcrieloue-main avatar Mar 15 '18 17:03 gcrieloue-main

C'est la vie! There is a reason I shouldn't be the one implementing the French parsers. :)

penrods avatar Mar 16 '18 07:03 penrods