hive icon indicating copy to clipboard operation
hive copied to clipboard

HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts

Open okumin opened this issue 1 year ago • 1 comments

What changes were proposed in this pull request?

Make SUM return not 0.0 but NULL when both of the following conditions are satisfied.

  • The input type is a STRING like type
  • All values are non-numeric and it is impossible to successfully cast to DOUBLE

https://issues.apache.org/jira/browse/HIVE-28302

Why are the changes needed?

I understand the SQL standard requires a UDAF to return NULL when all rows are NULL. Actually, other UDAFs such as AVG behave like that.

We can see some more discussions in https://github.com/apache/hive/pull/5091

Does this PR introduce any user-facing change?

Yes. But I believe the current behavior is not an intentional one but a kind of bug.

Is the change a dependency upgrade?

No.

How was this patch tested?

Updated integration tests.

okumin avatar Jun 06 '24 12:06 okumin

In the test cbo_aggregate_reduce_functions_rule.q aggregate functions under test (sum, count, stddev*, etc) are wrapped into a round function call. I understand the reason however I think it can cause some noise: if an expression like

ROUND(SUM(c_numeric), 3)

returns invalid result it can have multiple reasons: sum is wrong or round is wrong or both. In theory if both are wrong as a side effect the expression itself can return a good result in some cases hence a potential bug in both function implementations remains hidden.

WDYT?

kasakrisz avatar Jul 05 '24 07:07 kasakrisz

@kasakrisz I don't have any objections. The original intention is human-readability and floating-point error mitigation. I tried to remove ROUND.

To be fair, we observe difference between CBO and non-CBO.

< +228.0  228.0   0.0     0.0     193.0   193.0   28.5    28.5    NULL    NULL    32.166666666666664      32.166666666666664      47.34448225506326       47.34448225506326       NULL    NULL    53.52387836802893       53.52387836802893       50.61338050075578  50.61338050075578       NULL    NULL    58.63247109466449       58.63247109466449       2241.5  2241.5  NULL    NULL    2864.805555555555       2864.805555555555       2561.714285714286       2561.714285714286       NULL    NULL    3437.7666666666664 3437.7666666666664      228.0   228.0   NULL    NULL    193.0   193.0   8       8       8       0       8       6
---
> +228.0  228.0   0.0     0.0     193.0   193.0   28.5    28.5    NULL    NULL    32.166666666666664      32.166666666666664      47.34448225506326       47.34448225506326       NULL    NULL    53.52387836802894       53.52387836802894       50.61338050075578  50.61338050075578       NULL    NULL    58.632471094664496      58.632471094664496      2241.5  2241.5  NULL    NULL    2864.805555555556       2864.805555555556       2561.714285714286       2561.714285714286       NULL    NULL    3437.7666666666673 3437.7666666666673      228.0   228.0   NULL    NULL    193.0   193.0   8       8       8       0       8       6

For better visibility,

< 53.52387836802893
< 53.52387836802893
---
> 53.52387836802894
> 53.52387836802894
71,72c71,72
< 58.63247109466449
< 58.63247109466449
---
> 58.632471094664496
> 58.632471094664496
77,78c77,78
< 2864.805555555555
< 2864.805555555555
---
> 2864.805555555556
> 2864.805555555556
83,84c83,84
< 3437.7666666666664
< 3437.7666666666664
---
> 3437.7666666666673
> 3437.7666666666673

I presume they are acceptable as floating-point errors.

okumin avatar Jul 07 '24 13:07 okumin

Thanks!

okumin avatar Jul 22 '24 08:07 okumin