milli
milli copied to clipboard
Enhance word splitting strategy
Today the word splitting strategy of the query tree is handled by the function split_best_frequency.
This function split a word into two sub-words by looking at the frequency of the less frequent sub-word of each possible pair.
drawback
However, this frequency computation doesn't represent faithfully the frequency of the pair because these two sub-words can be considered individually frequent without being frequently near together in documents.
possible enhancement
Inspired by the index.rs#word_documents_count method, a new method word_pair_frequency could be implemented in the trait search/query_tree.rs#Context using the word_pair_proximity_docids database instead of the word_docids one.
small warnings
- the trait
Contextis derived 2 times, one in search/query_tree.rs#QueryTreeBuilder which is the real implementation, and one in the tests - the database
word_pair_proximity_docidshas a different CODEC thanword_docidsbut a length decoder is already implemented in heed_codec/roaring_bitmap_length/