Enhance word splitting strategy

Open ManyTheFish opened this issue 3 years ago • 0 comments

Today the word splitting strategy of the query tree is handled by the function split_best_frequency. This function split a word into two sub-words by looking at the frequency of the less frequent sub-word of each possible pair.

drawback

However, this frequency computation doesn't represent faithfully the frequency of the pair because these two sub-words can be considered individually frequent without being frequently near together in documents.

possible enhancement

Inspired by the index.rs#word_documents_count method, a new method word_pair_frequency could be implemented in the trait search/query_tree.rs#Context using the word_pair_proximity_docids database instead of the word_docids one.

small warnings

the trait Context is derived 2 times, one in search/query_tree.rs#QueryTreeBuilder which is the real implementation, and one in the tests
the database word_pair_proximity_docids has a different CODEC than word_docids but a length decoder is already implemented in heed_codec/roaring_bitmap_length/

Files expected to be modified

milli/src/search/query_tree.rs

Sep 12 '22 12:09 ManyTheFish