HyperLogLog icon indicating copy to clipboard operation
HyperLogLog copied to clipboard

Serialization/deserialization of HyperLogLog objects leads to big error gap

Open Squalene opened this issue 2 years ago • 2 comments

Hi,

First of all, thank you very much for this implementation. While playing with the library, I found out that serializing and deserializing a HyperLogLog object and then merging it to another leads to a big drop in accuracy. Here is the code to reproduce:

Python: 3.9.16 HLL: 2.0.3

from HLL import HyperLogLog
import random 
import pickle 
random.seed(0)
def test_union_precision(serde=False):
    union_count = 1000
    candidate_values = [str(i) for i in range(100_000)]
    picked_values = set()
    agg_hll = HyperLogLog(p=8, seed = 0)
    for _ in range(union_count):
        hll = HyperLogLog(p=8, seed = 0)
        values = random.sample(candidate_values, k=random.randint(0, 100))
        picked_values.update(values)
        for v in values:
            hll.add(v)
        if(serde):
            hll = pickle.loads(pickle.dumps(hll))
        agg_hll.merge(hll)

    deviation = agg_hll.cardinality()/len(picked_values)
    return deviation

print(test_union_precision(serde=False), test_union_precision(serde=True))

gives

1.048  0.130

I have seen the issues resolved previously and indeed my registers are all the same before and after serialization/deserialization so I suspect the error to be somewhere else but I am not familiar enough with the codebase to find it.

Thank you in advance for your help

Squalene avatar Apr 17 '23 07:04 Squalene

Thanks for reporting this. I suspect this is related to serialization/deserialization of the registers when in sparse representation. I will investigate. As a temporary fix, you can do sparse=False in the HyperLogLog constructor e.g. hll = HyperLogLog(p=8, seed=0, sparse=False).

ascv avatar Apr 17 '23 20:04 ascv

This indeed solves the issue, thank you.

Squalene avatar Apr 18 '23 06:04 Squalene

This should be fixed on #47.

ascv avatar Aug 04 '24 05:08 ascv