HIVE-28441: NPE in ORC tables when hive.orc.splits.include.file.footer is enabled
What changes were proposed in this pull request?
Check HIVE-28441 for steps to reproduce this issue and stacktrace
Why are the changes needed?
NullPointerException is thrown when hive.orc.splits.include.file.footer is enabled in ORC tables
Does this PR introduce any user-facing change?
NO
Is the change a dependency upgrade?
NO
How was this patch tested?
Using a q file present in the commits.
mvn clean test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=orc_footer_enabled.q -pl itests/qtest -Pitests -Dtest.output.overwrite=true
mvn clean test -Dtest=TestMiniTezCliDriver -Dqfile=orc_footer_enabled.q -pl itests/qtest -Pitests -Dtest.output.overwrite=true
As per my understanding:
-
One of the benefit of enabling hive.orc.splits.include.file.footer is to reduce fs calls as explained in HIVE-15038. In ORC code, extractFileTail https://github.com/apache/orc/blob/7878691befc66ecc372ff41715cbdff97ec7aafd/java/core/src/java/org/apache/orc/impl/ReaderImpl.java#L569 make a fs call for creating OrcTail but with the config enabled, it was optimized and we were creating OrcTail object in OrcSplit.java https://github.com/apache/hive/blob/d0d5d6d7d11b3eece0d0bc17b429cb30dec5dc79/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java#L230
-
In HIVE-15665 with hive.orc.splits.include.file.footer enabled, it requires the OrcTail to have serializedTail present (passing null or empty BufferChunk won't help as it will throw NPE) https://github.com/apache/hive/blob/d0d5d6d7d11b3eece0d0bc17b429cb30dec5dc79/llap-server/src/java/org/apache/hadoop/hive/llap/io/encoded/OrcEncodedDataReader.java#L669
-
Possible fix is while creating OrcTail in OrcSplit.java, we "somehow" get the serializedTail without making additional fs call or we need to revert HIVE-15038, doing so will force the orcReader in OrcEncodedDataReader.java to get perform extractFileTail which will have the serializedTail.
-
I have gone with reverting the HIVE-15038. Looking forward for suggestions on this.
@pgaref, can you please provide your insights on this?
Quality Gate passed
Issues
0 New issues
0 Accepted issues
Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code
@Aggarwal-Raghav, is there still some benefit of hive.orc.splits.include.file.footer without HIVE-15038?
@deniskuzZ, thanks for looking into this. I think in Tez on Yarn, we can still prevent an additional fs call with this config.
@zhangbutao / @deniskuzZ , can you please suggest the next step that can help here?
@Aggarwal-Raghav I did some codes debug. Found that the Tez Application Master has already initialize the OrcTail when creating orc splits. I want to know if we can pass the OrcTail from Tez AM to Tez Task? If ok, we can solve this issue by the way.
Here i provided some related codes, maybe we can try to do some code debug to explore a better way to fix the issue?
- Tez AM related codes: create orc split with the orctail
https://github.com/apache/hive/blob/13dfae1c0a7d4540f4bc5edc50bc922f0cfc83e8/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1672-L1673
https://github.com/apache/hive/blob/13dfae1c0a7d4540f4bc5edc50bc922f0cfc83e8/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcInputFormat.java#L1497-L1498
- Tez Task related code: get orc split(But it can not get the orctail now, we can think about how to get it here?)
https://github.com/apache/hive/blob/13dfae1c0a7d4540f4bc5edc50bc922f0cfc83e8/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L216-L223
https://github.com/apache/hive/blob/13dfae1c0a7d4540f4bc5edc50bc922f0cfc83e8/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSplit.java#L204-L205
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Feel free to reach out on the [email protected] list if the patch is in need of reviews.