HIVE-25351: stddev(), stddev_pop() with CBO enable returning null
What changes were proposed in this pull request?
when cbo is enable
STDDEV_SAMP(x) → SQRT( CASE WHEN ((SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) < 0) THEN 0 THEN (SUM(x * x) - SUM(x) * SUM(x) / COUNT(x)) / CASE COUNT(x) WHEN 1 THEN NULL ELSE COUNT(x) - 1 END)
Why are the changes needed?
When the column type is double or decimal then SUM(x * x) and SUM(x) * SUM(x) / COUNT(x) values difference is coming negative value (very very small difference in fractional value) which is leading to NaN when applied SQRT(For more detailed info, see https://github.com/apache/hive/pull/4812)
How was this patch tested?
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
No data about Coverage
No data about Duplication
Hi, @vineetgarg02, could you please review my merge request? I would appreciate your feedback on the change. thank you!
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.
hi @yangjiandan, it seems that this PR went under the radar. Could you please rebase
STDDEV_SAMP(x)
hi @yangjiandan, do you plan to continue with the PR?
Hi, @deniskuzZ sorry for the late response! Yes, I do plan to continue working on this PR and will update soon. Thanks for your patient!
@deniskuzZ current master branch may have problem. when I run the following cmd under master branch with any code changing, it throws exception. my command:
mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=stddev_variance_with_double_decimal_test.q -Dtest.output.overwrite=true
Exception:
[ERROR] Failures:
[ERROR] TestMiniLlapLocalCliDriver.testCliDriver:62 Client execution failed with error code = 40000
running
SELECT account_id,
STDDEV(interest_paid) AS sdev
FROM (
SELECT 10 AS account_id, CAST(23.79 AS DOUBLE) interest_paid
UNION ALL SELECT 10, 23.79
UNION ALL SELECT 10, 23.79
UNION ALL SELECT 11, 64.34
UNION ALL SELECT 11, 64.34
UNION ALL SELECT 11, 64.34
) tempDataset
GROUP BY account_id
fname=stddev_variance_with_double_decimal_test.q
See ./ql/target/tmp/log/hive.log or ./itests/qtest/target/tmp/log/hive.log, or check ./ql/target/surefire-reports or ./itests/qtest/target/surefire-reports/ for specific test cases logs.
org.apache.hadoop.hive.ql.parse.SemanticException: Line 0:-1 Not yet supported place for UDAF 'sum'
at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.validateUDF(TypeCheckProcFactory.java:821)
at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.getXpathOrFuncExprNodeDesc(TypeCheckProcFactory.java:968)
at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory$DefaultExprProcessor.process(TypeCheckProcFactory.java:1481)
at org.apache.hadoop.hive.ql.lib.CostLessRuleDispatcher.dispatch(CostLessRuleDispatcher.java:66)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatchAndReturn(DefaultGraphWalker.java:105)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.dispatch(DefaultGraphWalker.java:89)
at org.apache.hadoop.hive.ql.lib.ExpressionWalker.walk(ExpressionWalker.java:101)
at org.apache.hadoop.hive.ql.lib.DefaultGraphWalker.startWalking(DefaultGraphWalker.java:120)
at org.apache.hadoop.hive.ql.parse.type.TypeCheckProcFactory.genExprNode(TypeCheckProcFactory.java:231)
at org.apache.hadoop.hive.ql.parse.type.ExprNodeTypeCheck.genExprNode(ExprNodeTypeCheck.java:49)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genAllExprNodeDesc(SemanticAnalyzer.java:13609)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13564)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13532)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genExprNodeDesc(SemanticAnalyzer.java:13526)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapGroupByOperator(SemanticAnalyzer.java:5863)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genGroupByPlanMapAggrNoSkew(SemanticAnalyzer.java:6805)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:11485)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12441)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:12307)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.genOPTree(CalcitePlanner.java:653)
at org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.analyzeInternal(SemanticAnalyzer.java:13180)
at org.apache.hadoop.hive.ql.parse.CalcitePlanner.analyzeInternal(CalcitePlanner.java:479)
at org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:336)
at org.apache.hadoop.hive.ql.Compiler.analyze(Compiler.java:224)
at org.apache.hadoop.hive.ql.Compiler.compile(Compiler.java:109)
at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:498)
at org.apache.hadoop.hive.ql.Driver.compileInternal(Driver.java:450)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:414)
at org.apache.hadoop.hive.ql.Driver.compileAndRespond(Driver.java:408)
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.compileAndRespond(ReExecDriver.java:126)
at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:234)
at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:257)
at org.apache.hadoop.hive.cli.CliDriver.processCmd1(CliDriver.java:201)
at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:127)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:425)
at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:356)
at org.apache.hadoop.hive.ql.QTestUtil.executeClientInternal(QTestUtil.java:746)
at org.apache.hadoop.hive.ql.QTestUtil.executeClient(QTestUtil.java:716)
at org.apache.hadoop.hive.cli.control.CoreCliDriver.runTest(CoreCliDriver.java:116)
at org.apache.hadoop.hive.cli.control.CliAdapter.runTest(CliAdapter.java:157)
at org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver(TestMiniLlapLocalCliDriver.java:62)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.apache.hadoop.hive.cli.control.CliAdapter$2$1.evaluate(CliAdapter.java:135)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.junit.runners.Suite.runChild(Suite.java:128)
at org.junit.runners.Suite.runChild(Suite.java:27)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.apache.hadoop.hive.cli.control.CliAdapter$1$1.evaluate(CliAdapter.java:95)
at org.junit.rules.RunRules.evaluate(RunRules.java:20)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:316)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:240)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:214)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:155)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:385)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:507)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:495)
[INFO]
[ERROR] Tests run: 1, Failures: 1, Errors: 0, Skipped: 0
hi @yangjiandan, sorry I missed the notification. I think the issue is related to your local changes (-1 Not yet supported place for UDAF 'sum') since master is green.
Could you please check under debug if we generate a valid SQL for STDDEV?
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
Is this PR ready to be merged in branch-4.1&master branch?
Is this PR ready to be merged in branch-4.1&master branch?
@zhangbutao, first we need to merge it in master, and since I refactored the PR, we need another +1. Similar change was already merged into Calcite: https://github.com/apache/calcite/pull/4448
hi @kasakrisz, could you please +1?
https://github.com/apache/calcite/pull/4448 is already merged. Once it is released, we could try replacing the whole HiveAggregateReduceFunctionsRule
hi @kasakrisz, could you please +1? apache/calcite#4448 is already merged. Once it is released, we could try replacing the whole
HiveAggregateReduceFunctionsRule
Currently we have vectorized implementation of the Sum function only but not the Sum0 function.
https://github.com/apache/hive/blob/master/ql/src/gen/vectorization/UDAFTemplates/VectorUDAFSum.txt
VectorUDAFSum0 is required to use AggregateReduceFunctionsRule shipped with Calcite
https://github.com/apache/calcite/blob/e536d3949674cbbc0fd1162b3fccb211ca75789b/core/src/main/java/org/apache/calcite/rel/rules/AggregateReduceFunctionsRule.java#L289-L292
In HiveAggregateReduceFunctionsRule we actually do the opposite: convert sum0 to sum in order to exploit vectorization
https://github.com/apache/hive/blob/1a7fbd7183554f9ace10fbf7d066677eff0447a7/ql/src/java/org/apache/hadoop/hive/ql/optimizer/calcite/rules/HiveAggregateReduceFunctionsRule.java#L213-L215