milli icon indicating copy to clipboard operation
milli copied to clipboard

Harmless undefined behavior: prefix databases generation

Open ManyTheFish opened this issue 4 years ago • 0 comments

Prefix FST has no real "defined" threshold on the number of indexed words to choose if we need prefix databases. But there is an undefined behavior that adds that threshold, which is, by default, 1000 different indexed words in the index.

Why?

To choose if we create a prefix cache for a group of words that begin by the same characters we have a threshold that check if the number of words starting by the prefix is higher than 0.1% of the total number of words indexed in the index: https://github.com/meilisearch/milli/blob/cb45a10bcd249bddad7af55972d15a6cf4102e85/milli/src/update/words_prefixes_fst.rs#L52-L54

https://github.com/meilisearch/milli/blob/cb45a10bcd249bddad7af55972d15a6cf4102e85/milli/src/update/words_prefixes_fst.rs#L82-L85

But what happens if we have less than 1000 words indexed in our index?

 let threshold = 0.1/100; // 0.001
 let number_of_words = 999; 
 let min_number_of_words = (number_of_words as f64 * threshold) as size;
 // (999 * 0.001) = 0.999 -> as usize -> 0

the minimum number of words to index a prefix is 0 meaning that we would never enter in the if-condition: https://github.com/meilisearch/milli/blob/cb45a10bcd249bddad7af55972d15a6cf4102e85/milli/src/update/words_prefixes_fst.rs#L82-L85 because current_prefix_count is at least 1, and so, no prefix will be indexed.

I don't think that changing this condition like below is a good idea, because we would generate a prefix cache for all the words of the database.

 // There is enough words corresponding to this prefix to add it to the cache. 
- if current_prefix_count == min_number_of_words { 
+ if current_prefix_count >= min_number_of_words { 
     builder.insert(prefix)?; 
 } 

We may instead:

  • add another threshold on the minimum number of indexed words to choose if we create prefix databases.
  • replace the rate threshold with an absolute number threshold, if we have more than x words that have the same prefix we index the prefix

ManyTheFish avatar Aug 04 '21 13:08 ManyTheFish