Harmless undefined behavior: prefix databases generation
Prefix FST has no real "defined" threshold on the number of indexed words to choose if we need prefix databases. But there is an undefined behavior that adds that threshold, which is, by default, 1000 different indexed words in the index.
Why?
To choose if we create a prefix cache for a group of words that begin by the same characters we have a threshold that check if the number of words starting by the prefix is higher than 0.1% of the total number of words indexed in the index:
https://github.com/meilisearch/milli/blob/cb45a10bcd249bddad7af55972d15a6cf4102e85/milli/src/update/words_prefixes_fst.rs#L52-L54
https://github.com/meilisearch/milli/blob/cb45a10bcd249bddad7af55972d15a6cf4102e85/milli/src/update/words_prefixes_fst.rs#L82-L85
But what happens if we have less than 1000 words indexed in our index?
let threshold = 0.1/100; // 0.001
let number_of_words = 999;
let min_number_of_words = (number_of_words as f64 * threshold) as size;
// (999 * 0.001) = 0.999 -> as usize -> 0
the minimum number of words to index a prefix is 0 meaning that we would never enter in the if-condition:
https://github.com/meilisearch/milli/blob/cb45a10bcd249bddad7af55972d15a6cf4102e85/milli/src/update/words_prefixes_fst.rs#L82-L85
because current_prefix_count is at least 1, and so, no prefix will be indexed.
I don't think that changing this condition like below is a good idea, because we would generate a prefix cache for all the words of the database.
// There is enough words corresponding to this prefix to add it to the cache.
- if current_prefix_count == min_number_of_words {
+ if current_prefix_count >= min_number_of_words {
builder.insert(prefix)?;
}
We may instead:
- add another threshold on the minimum number of indexed words to choose if we create prefix databases.
- replace the rate threshold with an absolute number threshold, if we have more than
xwords that have the same prefix we index the prefix