HIVE-28302: Let SUM UDF return NULL when all rows have non-numeric texts
What changes were proposed in this pull request?
Make SUM return not 0.0 but NULL when both of the following conditions are satisfied.
- The input type is a STRING like type
- All values are non-numeric and it is impossible to successfully cast to DOUBLE
https://issues.apache.org/jira/browse/HIVE-28302
Why are the changes needed?
I understand the SQL standard requires a UDAF to return NULL when all rows are NULL. Actually, other UDAFs such as AVG behave like that.
We can see some more discussions in https://github.com/apache/hive/pull/5091
Does this PR introduce any user-facing change?
Yes. But I believe the current behavior is not an intentional one but a kind of bug.
Is the change a dependency upgrade?
No.
How was this patch tested?
Updated integration tests.
In the test cbo_aggregate_reduce_functions_rule.q aggregate functions under test (sum, count, stddev*, etc) are wrapped into a round function call. I understand the reason however I think it can cause some noise: if an expression like
ROUND(SUM(c_numeric), 3)
returns invalid result it can have multiple reasons: sum is wrong or round is wrong or both.
In theory if both are wrong as a side effect the expression itself can return a good result in some cases hence a potential bug in both function implementations remains hidden.
WDYT?
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@kasakrisz I don't have any objections. The original intention is human-readability and floating-point error mitigation. I tried to remove ROUND.
To be fair, we observe difference between CBO and non-CBO.
< +228.0 228.0 0.0 0.0 193.0 193.0 28.5 28.5 NULL NULL 32.166666666666664 32.166666666666664 47.34448225506326 47.34448225506326 NULL NULL 53.52387836802893 53.52387836802893 50.61338050075578 50.61338050075578 NULL NULL 58.63247109466449 58.63247109466449 2241.5 2241.5 NULL NULL 2864.805555555555 2864.805555555555 2561.714285714286 2561.714285714286 NULL NULL 3437.7666666666664 3437.7666666666664 228.0 228.0 NULL NULL 193.0 193.0 8 8 8 0 8 6
---
> +228.0 228.0 0.0 0.0 193.0 193.0 28.5 28.5 NULL NULL 32.166666666666664 32.166666666666664 47.34448225506326 47.34448225506326 NULL NULL 53.52387836802894 53.52387836802894 50.61338050075578 50.61338050075578 NULL NULL 58.632471094664496 58.632471094664496 2241.5 2241.5 NULL NULL 2864.805555555556 2864.805555555556 2561.714285714286 2561.714285714286 NULL NULL 3437.7666666666673 3437.7666666666673 228.0 228.0 NULL NULL 193.0 193.0 8 8 8 0 8 6
For better visibility,
< 53.52387836802893
< 53.52387836802893
---
> 53.52387836802894
> 53.52387836802894
71,72c71,72
< 58.63247109466449
< 58.63247109466449
---
> 58.632471094664496
> 58.632471094664496
77,78c77,78
< 2864.805555555555
< 2864.805555555555
---
> 2864.805555555556
> 2864.805555555556
83,84c83,84
< 3437.7666666666664
< 3437.7666666666664
---
> 3437.7666666666673
> 3437.7666666666673
I presume they are acceptable as floating-point errors.
Thanks!