Performance optimization: Move all LittleEndianDataInputStream functionality into ByteBufferInputStream
I broke up https://github.com/apache/parquet-mr/pull/953 into more digestible pieces. This new PR is the lowest level set of changes. By themselves, these additions to ByteBufferInputStream don't yield much improvement, so future PRs will include modifications to other source files that take advantage of this new functionality.
The complete set of changes (including subsequent PRs) is for performance optimization. In benchmarking with Trino, we find query performance to improve from 5% to 15%, depending on the query, and that includes all the I/O time from S3.
All of LittleEndianDataInputStream functionality is moved into ByteBufferInputStream, without changing any pre-existing interfaces or functionality. These changes yield the following benefits:
- Elimination of extra layers of abstraction and method call overhead
- Enable the use of intrinsics for readInt, readLong, etc.
- Availability of faster access methods like readFully and skipFully, without the need for helper functions
This PR also marks LittleEndianDataInputStream as deprecated.
Context: I've been working on improving Parquet reading performance in Trino, mostly by profiling while running performance benchmarks and TPCDS queries. This PR is a subset of the changes I made that have more than doubled the performance of a lot of TPCDS queries (wall clock time, including the S3 access time). If you are kind enough to accept these changes, I look forward to offering further performance improvements.
I added some tests for this new code.
Is the '5% to 15%' gain from this change or along with other changes? If it is later, can you share the point to other changes? Like to see the overall changes before committing.
That improvement comes from a larget set of changes. I have a design doc that goes over all those changes plus some more that make it possible to get even more performance improvements.
https://docs.google.com/document/d/1fBGpF_LgtfaeHnPD5CFEIpA2Ga_lTITmFdFIcO9Af-g/edit?usp=sharing
@sunchao Can you have a review?
Is this mostly a refactoring PR? I also don't see LittleEndianDataInputStream being marked as deprecated.
Is this mostly a refactoring PR? I also don't see
LittleEndianDataInputStreambeing marked as deprecated.
I initially marked LittleEndianDataInputStream as deprecated. But I seem to recall that I was advised by Ryan Blue to do that in a later PR.
@theosib-amazon Thanks again for your contribution! I see the comments are generally around duplicating code, refactoring, and making code maintainable. If you have a measurement of improvement on this change alone, it would help the reviewers