hyperloglog.rs
hyperloglog.rs copied to clipboard
Incorrect `merge_sparse` for HLL++?
Please feel free to close this if my understanding is incorrect.
In the paper for HLL++, MERGE is defined as the following
Expects two sorted lists a and b, where the first is com-
pressed using a variable length and difference encoding.
Returns a list that is sorted and compressed in the same
way as a, and contains all elements from a and b, except
for entries where another element with the same index,
but higher q(w) value exists. This can be implemented
in a single linear pass over both lists
In the implementation of hyperloglogplus::merge_sparse, we are not comparing the values of q(w) and just insert duplicated indexes as-is. This might result incorrect values when Counting with sparse because we are potentially counting the same index multiple times.