stopwords feature

Open jeromew opened this issue 12 years ago • 0 comments

Hello,

I am thinking about adding a stopwords feature inside your tokenizer implementation.

In fact, I want to 1/ use unicodesn 2/ avoid stopwords in the index 3/ use the sqlite3 snippet() feature with the original full text correctly emphasized

Since sqlite3 can not have a "chain" of tokenizer, I want to make unicodesn stopwords aware. you can look for example at https://github.com/abhinav-upadhyay/apropos_replacement/commit/76b45695f2962921f70a946bef04129a670ec04d where they did something similar for the porter tokenizer.

can you help me choose a strategy for implementing this in unicodesn that you could accept upstream ?

One possibility could be to add the lowercase stopwords list in each header file of libstemmer_c/src_c

Thanks for you help

Nov 05 '13 19:11 jeromew