Tokenizer Internationalization - French
We should test to see if the EnglishTokenizer impl is sufficient for French, and if not, add an additional tokenizer. EnglishTokenizer is based on porter stemmer.
For words such as "j'ajoute", I would like "ajoute" to be a word (a keyword actually) but it doesn't work.
I think french tokenizer is pretty similar to the english one except for this quote rule (which has exceptions such as words like "aujourd'hui").
I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.
Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027
Hello,
While it's and it is are both valid in English, Sadly "je aime" is not valid in French.
(And btw Amie is not a verb, it means friend)
Le jeu. 15 mars 2018 à 18:07, Steve Penrod [email protected] a écrit :
I know this is really old, but I'm curious if this fits in to the "normalize()" approach I've implemented in English. Essentially I do a pre-pass on text that does things like coverts "it's" to "it is", simplifying parsing.
Does it make sense to do a French normalize() preprocessor that converts things like "j'amie" to "je amie"? This would live in: https://github.com/MycroftAI/mycroft-core/blob/dev/mycroft/util/lang/parse_fr.py#L1027
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/MycroftAI/adapt/issues/3#issuecomment-373451760, or mute the thread https://github.com/notifications/unsubscribe-auth/AE9-PRNB7k1UT6fEZVi3QdopPojwD2i2ks5tep_egaJpZM4HBT-4 .
C'est la vie! There is a reason I shouldn't be the one implementing the French parsers. :)