Serialization/deserialization of HyperLogLog objects leads to big error gap
Hi,
First of all, thank you very much for this implementation. While playing with the library, I found out that serializing and deserializing a HyperLogLog object and then merging it to another leads to a big drop in accuracy. Here is the code to reproduce:
Python: 3.9.16 HLL: 2.0.3
from HLL import HyperLogLog
import random
import pickle
random.seed(0)
def test_union_precision(serde=False):
union_count = 1000
candidate_values = [str(i) for i in range(100_000)]
picked_values = set()
agg_hll = HyperLogLog(p=8, seed = 0)
for _ in range(union_count):
hll = HyperLogLog(p=8, seed = 0)
values = random.sample(candidate_values, k=random.randint(0, 100))
picked_values.update(values)
for v in values:
hll.add(v)
if(serde):
hll = pickle.loads(pickle.dumps(hll))
agg_hll.merge(hll)
deviation = agg_hll.cardinality()/len(picked_values)
return deviation
print(test_union_precision(serde=False), test_union_precision(serde=True))
gives
1.048 0.130
I have seen the issues resolved previously and indeed my registers are all the same before and after serialization/deserialization so I suspect the error to be somewhere else but I am not familiar enough with the codebase to find it.
Thank you in advance for your help
Thanks for reporting this. I suspect this is related to serialization/deserialization of the registers when in sparse representation. I will investigate. As a temporary fix, you can do sparse=False in the HyperLogLog constructor e.g. hll = HyperLogLog(p=8, seed=0, sparse=False).
This indeed solves the issue, thank you.
This should be fixed on #47.