HIVE-29012: NPE in ExplainTask when protologging posthook is enabled
What changes were proposed in this pull request?
Check HIVE-29012 for the stacktrace.
Why are the changes needed?
When protologging hook is enabled, it stores the query plan for a query as well. The conf object in ExplainTask is null for the "Hive Hook Proto Log Writer" thread.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Will see CI output + local single node setup/cluster testing
Based on the debugging, the Task#initialize() method creates conf object in HiveServer2-Handler-Pool thread and when the post hook for protologging is triggered, it created a new thread (Hive Hook Proto Log Writer) which get the query plan and at that time, ExplainTask#outputPlan() has conf as null which causes NPE.
https://github.com/apache/hive/blob/af59f241a07fb9960412529be71c6bfa1fcd042e/ql/src/java/org/apache/hadoop/hive/ql/exec/ExplainTask.java#L1006
@zabetak, can you please help with the review as the changes are around HIVE-22173's code.
Just info: HIVE-27240, was very similar to this issue based on stacktrace, which shows the need for UT :-)
Hi @zabetak, apologies for late reply, just wanted to let you know that I'm working on this PR and found some new issues like HIVE-29053 along the way when coming up with UT.
Writing UT for this PR is not straightforward and need your opinion on this. There are 2 things to note:
- In case of NPE, its just log the error and continue as intended, which means using
TestHiveHookEventProtoPartialBuilderorTestHiveProtoLoggingHookwon't cut it IMO (tried with mockito as well, didn't work). - Writes/flush of proto data are not consistent based on ScheduledThreadPoolExecutor https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/hooks/HiveProtoLoggingHook.java#L277, one way to force flush is to stop the HS2.
What I/m thinking is, writing a itest/hive-unit integration test as below code. Because of [2], I'm not able to make qfile work as the proto data is written post test completition.
CREATE TABLE tbl1 (id INT, name STRING);
INSERT INTO tbl1 VALUES (1, 'Alice'), (2, 'Bob');
SELECT * FROM tbl1;
msck repair table sys.proto_hive_query_data;
select count(*) from sys.proto_hive_query_data where lower(trim(regexp_replace(get_json_object(otherinfo[1].value, '$.queryText'), '\n|\\\s+', ' '))) = "select * from tbl1";
Because of NPE the QUERY Object will not be part of proto data so without fix the output should be 0, with fix it should be 1.
The end to end approach (2.) seems a bit of an overkill. Even if the NPE is caught and logged you still have some missing entries (e.g., queryPlan) in the otherInfo map. You could probably base your test on constructing a simple ExplainWork object and reason about missing entries in the info map. Would that work as a unit test?
The end to end approach (2.) seems a bit of an overkill. Even if the NPE is caught and logged you still have some missing entries (e.g.,
queryPlan) in theotherInfomap. You could probably base your test on constructing a simpleExplainWorkobject and reason about missing entries in the info map. Would that work as a unit test?
Thanks for the suggestions/help @zabetak. I have come up with a test case highlighting the problem observed in local cluster. Can you please review and confirm if its up to the standards?
Test failure org.apache.hadoop.hive.metastore.security.TestZookeeperTokenStoreSSLEnabled, Seems irrelevant to change.
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@zabetak , I have addressed the review comments. can you please take a look?
Thanks for the PR @Aggarwal-Raghav !