stopwords feature
Hello,
I am thinking about adding a stopwords feature inside your tokenizer implementation.
In fact, I want to 1/ use unicodesn 2/ avoid stopwords in the index 3/ use the sqlite3 snippet() feature with the original full text correctly emphasized
Since sqlite3 can not have a "chain" of tokenizer, I want to make unicodesn stopwords aware. you can look for example at https://github.com/abhinav-upadhyay/apropos_replacement/commit/76b45695f2962921f70a946bef04129a670ec04d where they did something similar for the porter tokenizer.
can you help me choose a strategy for implementing this in unicodesn that you could accept upstream ?
One possibility could be to add the lowercase stopwords list in each header file of libstemmer_c/src_c
Thanks for you help