milli icon indicating copy to clipboard operation
milli copied to clipboard

Enhance word splitting strategy

Open ManyTheFish opened this issue 3 years ago • 0 comments

Today the word splitting strategy of the query tree is handled by the function split_best_frequency. This function split a word into two sub-words by looking at the frequency of the less frequent sub-word of each possible pair.

drawback

However, this frequency computation doesn't represent faithfully the frequency of the pair because these two sub-words can be considered individually frequent without being frequently near together in documents.

possible enhancement

Inspired by the index.rs#word_documents_count method, a new method word_pair_frequency could be implemented in the trait search/query_tree.rs#Context using the word_pair_proximity_docids database instead of the word_docids one.

small warnings

Files expected to be modified

ManyTheFish avatar Sep 12 '22 12:09 ManyTheFish