PARQUET-41: Add bloom filters to parquet statistics
It's the PR in mr part.
@spena @rdblue Could you help me review this patch? Thank you!
It's looking good Ferd.
Here are some questions I have.
- Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.
- If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?
- Is there a way to calculate the # of expected entries instead of asking the user to pass a value?
Hi @spena Please see my inline comments below. Thank you! (Sorry for some delays since I am taking a holiday :<)
Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.
At this point, I didn't add the support for fall back. If it's really useful, I think we could do it in a follow-up ticket.
If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?
The bloom filter is used to filter a entire row group in the same way as min/max statistics. I am not very familiar with dictionary encoding in parquet. But I think it should be used before dictionary encoding.
Is there a way to calculate the # of expected entries instead of asking the user to pass a value?
I tried to think about a way to calculate it but didn't come up with a good idea. But I think nobody understands the data better than the person who uses it.