parquet-java PARQUET-2237 Improve performance when filters in RowGroupFilter can match exactly

If we can accurately judge by the minMax status, we don’t need to load the dictionary from filesystem and compare one by one anymore.

Similarly , Bloomfilter needs to load from filesystem, it may costs time and memory. If we can exactly determine the existence/nonexistence of the value from minMax or dictionary filters , then we can avoid using Bloomfilter to Improve performance.

For example,

read data greater than x1 in the block, if minMax in status is all greater than x1, then we don't need to read dictionary from filesystem and compare one by one.
If we already have page dictionaries and have compared one by one, we don't need to read BloomFilter from filesystem and compare.
some more cases...

Feb 04 '23 09:02 yabola

@wgtmac Thanks for review! I will address your comments and I updated my PR description to explain in more detail. I think we can do more to avoid reading dictionary or Bloomfilter from filesystem in some places.

Feb 05 '23 08:02 yabola

cc @gszadovszky @ggershinsky @shangxinli

Feb 05 '23 14:02 wgtmac

It seems code compatibility check error caused by modifying visitor return value... can we remove that restrictions ? or I should keep the code compatibility and add new flag to mark it? I will add more UT later

Feb 05 '23 23:02 yabola

Unfortunately we cannot modify the signature of any public methods. My suggestion was to make the new enum serves as an internal state of the visitor (and probably use it to terminate evaluation early). Then add a new method to return the final state. Does it work?

Feb 06 '23 08:02 wgtmac

+1, let's not modify the signature.

Feb 06 '23 15:02 shangxinli

@wgtmac @shangxinli I thought of a way to avoid interface modification and distinguish by Boolean objects. Please take a look

Feb 07 '23 02:02 yabola

Emmmm, If this way is not suitable, I can use the filter internal variable to record it and keep compatibility

Feb 08 '23 15:02 yabola

@wgtmac Thanks for your review, I will add UT later. Sorry, Boolean type has to be used here, so that we can distinguish the BLOCK_MIGHT_MATCH and BLOCK_MUST_MATCH (we should compare by java references not value) . This is example:

Boolean b1 = new Boolean(true);
Boolean b2 = new Boolean(true);
boolean b3 = true;
boolean b4 = true;

assert b1 != b2;
assert b1.equals(b2);
assert b2 == b3 == b4;

Feb 10 '23 15:02 yabola

cc @zhongyujiang Not sure if you are interested in reviewing this PR.

Feb 20 '23 05:02 wgtmac

@gszadovszky @shangxinli If you have time, please also take a look, thanks~

Feb 21 '23 07:02 yabola

@wgtmac @gszadovszky I have a proposal to automatically build BloomFilter with a more precise size. I create a jira https://issues.apache.org/jira/browse/PARQUET-2254 and I hope to get your opinions, thank you.

Now the usage is to specify the size, and then build BloomFilter. In general scenarios, it is actually not sure how much the distinct value is. If BloomFilter can be automatically generated according to the data, the file size can be reduced and the reading efficiency can also be improved. I have an idea that the user can specify a maximum BloomFilter filter size, then we build multiple BloomFilter at the same time, we can use the largest BloomFilter as a counting tool( If there is no hit when inserting a value, the counter will be +1, of course this may be imprecise but enough) Then at the end of the write, choose a BloomFilter of a more appropriate size when the file is finally written.

Mar 07 '23 07:03 yabola

Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira.

Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think?

Mar 07 '23 08:03 gszadovszky

Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira.

Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think?

What I did in production is to issue async I/Os of dictionaries (if all data pages are dictionary-encoded in that column chunk and the dictionary is not big) and bloom filters of selected row groups in advance. The reason is to eliminate blocking I/O when pushing down the predicates. However, the parquet specs only records the offset to bloom filter. So I also added the length of each bloom filter in the key value metadata section (probably a good reason to add to the specs as well?)

Mar 07 '23 14:03 wgtmac

Thanks @yabola for coming up with this idea. Let's continue the discussion about the BloomFilter building idea in the jira.

Meanwhile, I've been thinking about the actual problem as well. Currently, for row group filtering we are checking the min/max values first which is correct since this is the most fast thing to do. Then the dictionary and then the bloom filter. The ordering of the latter two is not obvious to me in every scenarios. At the time of filtering we did not start reading the actual row group so there is no advantage in I/O to read the dictionary first. Furthermore, searching something in the bloom filter is much faster than in the dictionary. And the size of the bloom filter is probably less than the size of the dictionary. Though, it would require some measurements to prove if it is a good idea to get the bloom filter before the dictionary. What do you think?

It is a good idea to adjust filter order and prefer the use of lighter filters first to judge. But I have some concern (not sure if it is correct): In parquet dictionary will generate only in low-base data( see parquet.dictionary.page.size 1 MB), and BloomFilter is usually used in high base columns(?) . So ideally only one of these two will be used(?) And ideally and radically we can only use one of these two (don't judge both of them). If there is a BloomFilter and filter is = or in, only use the BloomFilter(no matter match or not match) , otherwise use the dictionary.

But this lacks practical scenarios, and I am not sure which is a better choice, need to think more about it.

Mar 07 '23 16:03 yabola