parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-41: Add bloom filters to parquet statistics

Open winningsix opened this issue 10 years ago • 3 comments

It's the PR in mr part.

winningsix avatar Jun 17 '15 05:06 winningsix

@spena @rdblue Could you help me review this patch? Thank you!

winningsix avatar Jun 17 '15 05:06 winningsix

It's looking good Ferd.

Here are some questions I have.

  • Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.
  • If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?
  • Is there a way to calculate the # of expected entries instead of asking the user to pass a value?

spena avatar Jun 17 '15 17:06 spena

Hi @spena Please see my inline comments below. Thank you! (Sorry for some delays since I am taking a holiday :<)

Should we use fall back for bloom filters in case the bloom is not good for the row group? Dictionary encoding does this.

At this point, I didn't add the support for fall back. If it's really useful, I think we could do it in a follow-up ticket.

If a value is found on the dictionary, is there a way to skip the bloom hashing for better write perf? And add the values to the bloom in case they are fallen back?

The bloom filter is used to filter a entire row group in the same way as min/max statistics. I am not very familiar with dictionary encoding in parquet. But I think it should be used before dictionary encoding.

Is there a way to calculate the # of expected entries instead of asking the user to pass a value?

I tried to think about a way to calculate it but didn't come up with a good idea. But I think nobody understands the data better than the person who uses it.

winningsix avatar Jun 23 '15 07:06 winningsix