HIVE-28503: Wrong results(NULL) when string concat operation with || operator for ORC file format when vectorization enabled
What changes were proposed in this pull request?
Set the output vector flag correctly about whether it has NULL or NOT NULL in case of all inputs values are NOT NULL. StringGroupConcatColCol->evaluate() method:
if (inV1.noNulls && !inV2.noNulls) { //if any one input has NULL, then output should be NULL. outV.noNulls = false; // setting this flag false as all values in this are NULLs --- code --- } else if (!inV1.noNulls && inV2.noNulls) { //if any one input has NULL, then output should be NULL. // propagate nulls outV.noNulls = false; // setting this flag false as all values in this are NULLs } else if (!inV1.noNulls && !inV2.noNulls) { // if two inputs are NULL, then output should be NULL. // propagate nulls outV.noNulls = false; //setting this flag false as all values in this are NULLs --- code --- } else { // there are no nulls in either input vector outV.noNulls = true; // this has to be set true, as there are no NULL values, this check is missed currently. // perform data operation --- code --- }
Why are the changes needed?
While doing concat() operation, In StringGroupConcatColCol class, if input batch vector has mixed of NULL and NOT NULL values of inputs then we are not setting output vector batch flags related to NULL and NOT NULLS correctly . Each value in the vector has the flag whether it is NULL or NOT NULL. But here we are not setting correctly the whole output vector flag (outV.noNulls).
Does this PR introduce any user-facing change?
No
Is the change a dependency upgrade?
No
How was this patch tested?
Existing Q files. Cant add new tests as this was able to reproduce only in cluster with more input records. Tested in cluster and verified.
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
ColumnVector initialises noNulls as true. I assume you tested it with a sample data received from a special case of a customer.
I wonder, is there a chance that the original problem has other vector expressions and maybe one of the previous steps sets noNulls to false.
About testing that path: There is a test in ql/src/test/org/apache/hadoop/hive/ql/exec/vector/expressions/TestVectorStringExpressions.java that looks like testing that code path: testColConcatCol
// no nulls, not repeating
batch = makeStringBatch2In1Out();
batch.cols[0].noNulls = true;
batch.cols[1].noNulls = true;
expr.evaluate(batch);
outCol = (BytesColumnVector) batch.cols[2];
...
Assert.assertTrue(outCol.noNulls);
Assert.assertFalse(outCol.isRepeating);
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.