parquet-format icon indicating copy to clipboard operation
parquet-format copied to clipboard

PARQUET-2249: Introduce IEEE 754 total order & NaN-counts

Open JFinis opened this issue 6 months ago • 7 comments

This commit is a combination of the following PRs:

  • Introduce IEEE 754 total order https://github.com/apache/parquet-format/pull/221
  • Add nan_count to handle NaNs in statistics https://github.com/apache/parquet-format/pull/196

Both these PRs try to solve the same problems; read the description of the respective PRs for explanation.

This PR is the result of an extended discussion in which it was repeatedly brought up that another possible solution to the problem could be the combination of the two approaches. Please refer to this discussion on the mailing list and in the two PRs for details. the mailing list discussion can be found here: https://lists.apache.org/thread/lzh0dvrvnsy8kvflvl61nfbn6f9js81s

The contents of this PR are basically a straightforward combination of the two approaches:

  • IEEE total order is introduced as a new order for floating point types
  • nan_count and nan_counts fields are added

Legacy writers may not write nan_count(s) fields, so readers have to handle them being absent. Also, legacy writers may have included NaNs into min/max bounds, so readers also have to handle that.

As there are no legacy writers writing IEEE total order, nan_count(s) are defined to be mandatory if this order is used, so readers can assume their presense when this order is used.

This commit removes nan_pages from the ColumnIndex, which the nan_counts PR mandated. We don't need them anymore, as IEEE total order solves this: As this is a new order and there are thus no legacy writers and readers, we have the freedom to define that for only-NaN pages using this order, we can actually write NaNs into min & max bounds in this case, and readers can assume that isNaN(min) signals an only-NaN page.

Note: If this PR is chosen to be adopted, I will update the commit message to state the problem without just pointing onto the other PRs. But until then, I rather link to them instead of repeating myself, in case this PR isn't chosen anyway.

Rationale for this change

What changes are included in this PR?

Do these changes have PoC implementations?

JFinis avatar Aug 09 '25 12:08 JFinis

@tustvold @crepererum @westonpace I would especially like to get your opions on this, as you were proponents of total order for simplicity. Does this PR still cover what you wanted to get out of this?

@mapleFU @etseidl @gszadovszky @pitrou FYI, as you were part of the initial discussion on the previous PRs.

JFinis avatar Aug 13 '25 09:08 JFinis

@tustvold

Am I understanding correctly that the change to add nan_counts means that NaNs are now excluded from the min/max statistics? I believe this creates a new problem for engines that use total ordering as the sign of the NaNs is not known, and therefore there are scenarios where predicates can't be pushed down where they could be in the absence of nan_counts.

No, the predicates still can be pushed down, regardless of the semantics of your engine.

If your engine uses total ordering with predicate col > c you skip a page when the statistics indicate col_max <= c && nan_count == 0. If your engine uses partial ordering you simply check if col_max <= c.

My 2 cents is that there probably is no perfect solution here, adding nan_counts is worse for some engines and better for some others

It's not worse for any engine as far as I can see, a strict improvement for everyone in filtering capabilities.

I don't really follow why NaNs are special in this regard, any extreme value has the same effect

NaNs aren't extreme values for engines without total ordering, they are simply unordered.

orlp avatar Aug 13 '25 10:08 orlp

I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example.

Regardless my broader point still stands that we are adding a non-trivial amount of complexity, there is a lot of subtlety both at read and write time, to yield an improvement that to be brutally honest I am not really sure will benefit all that many real workloads. It's all tradeoffs, if people want to and are motivated to proceed with the hybrid approach I don't feel strongly, but I personally prefer the simpler option that is more likely to be implemented correctly across the many many parquet implementations. The proposal here is strictly more complex than the existing specification which implementations still implement incorrectly.

tustvold avatar Aug 13 '25 10:08 tustvold

I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example.

Signed NaN counts are only useful to avoid treating a -NaN as a false positive for the col > c case for total ordering engines. To give a concrete example, if the page contains [-NaN, -3, 42] and the predicate is col > 100 then this page could be skipped by such an engine with statistics total_min = -NaN, total_max = 42 but not with min = -3, max = 42, nan_count = 1, because that singular nan could be positive NaN which would match +NaN > 100. Thus if signed nan counts are unavailable any NaN regardless of sign will have to be treated as an extremal value by those engines.

But only having a single NaN count doesn't block predicate pushdown whatsoever, and since you're fine with NaN poisoning anyway I don't think you'd care about the above -NaN edge case.

orlp avatar Aug 13 '25 10:08 orlp

I seem to remember from a previous discussion around the need to have signed NaNs to allow for predicate pushdown, but I can't remember the example.

Regardless my broader point still stands that we are adding a non-trivial amount of complexity, there is a lot of subtlety both at read and write time, to yield an improvement that to be brutally honest I am not really sure will benefit all that many real workloads. It's all tradeoffs, if people want to and are motivated to proceed with the hybrid approach I don't feel strongly, but I personally prefer the simpler option that is more likely to be implemented correctly across the many many parquet implementations. The proposal here is strictly more complex than the existing specification which implementations still implement incorrectly.

Gut feeling wise, I agree with your assessment. I feel like I personally can implement this correctly in our engine, but I have now spent years of my life thinking about this, so I guess I'm not representative for the average Parquet maintainer. I also feel that this has such a high chance of being implemented incorrectly that the more straightforward solution of just IEEE total ordering would be preferrable. My gut feels like the risk of added implementation complexity and thus chance for wrong implementations is higher than the advantage of not having "NaN poisoning" of statistics.

JFinis avatar Aug 13 '25 10:08 JFinis

I have filed a ticket to track trying to implement this in the arrow-rs implementation of Parquet, which will hopefully help us quantify how complicated this proposal is to implement:

  • https://github.com/apache/arrow-rs/issues/8156

alamb avatar Aug 15 '25 13:08 alamb

@tustvold @crepererum @westonpace I would especially like to get your opions on this, as you were proponents of total order for simplicity. Does this PR still cover what you wanted to get out of this?

@mapleFU @etseidl @gszadovszky @pitrou FYI, as you were part of the initial discussion on the previous PRs.

I think we have waited enough time for feedback. Do you want to start a vote with two proposals on the dev ML, @JFinis?

wgtmac avatar Sep 04 '25 07:09 wgtmac