parquet-java icon indicating copy to clipboard operation
parquet-java copied to clipboard

PARQUET-2196: Support LZ4_RAW codec

Open wgtmac opened this issue 3 years ago • 7 comments

This PR implements the LZ4_RAW codec which was introduced by parquet format v2.9.0. Since there are a lot of common logic between the LZ4_RAW and SNAPPY codecs, this patch moves them into NonBlockedCompressor and NonBlockedDecompressor and make the specific codec extend them.

Added TestLz4RawCodec test to make sure the new codec itself is correct.

wgtmac avatar Sep 27 '22 09:09 wgtmac

@pitrou @shangxinli Can you please take a look? Thanks in advance!

wgtmac avatar Sep 27 '22 09:09 wgtmac

cc @lidavidm @emkornfield : could you give this a look?

pitrou avatar Sep 27 '22 10:09 pitrou

@wgtmac Did you try to read an actual file produced by Parquet C++?

Note you can find such files in https://github.com/apache/parquet-testing/

Yes, I have tried that. I will add some parquet files for compatibility test as well.

wgtmac avatar Sep 27 '22 12:09 wgtmac

Thank Gang for contributing! Is there any benchmarking numbers? Any comparison with ZSTD? These are non-blocking questions for review and merging.

shangxinli avatar Sep 27 '22 14:09 shangxinli

Nice implementation! For the test, can you add more for interop with lz4?

shangxinli avatar Sep 27 '22 19:09 shangxinli

@shangxinli Thanks for the review.

I haven't be able to run the benchmark since the Hadoop Lz4Codec is implemented via JNI and I cannot simply run it on my Mac M1 laptop.

Generally this is the same LZ4 algorithm with the one shipped by Hadoop. The key difference is that LZ4_RAW has removed the redundant block header introduced by Hadoop Lz4Codec and its implementation is in pure Java.

I will add the interop test in a follow-up commit.

wgtmac avatar Sep 28 '22 16:09 wgtmac

The interop test has been added. Please take a look again. Thanks! @shangxinli @pitrou

wgtmac avatar Sep 29 '22 15:09 wgtmac

Looks good. The only thing is we checked in binary files directly. It would be hard to maintain in the future. Can you generate the parquet file using the parquetwriter?

shangxinli avatar Oct 17 '22 17:10 shangxinli

Looks good. The only thing is we checked in binary files directly. It would be hard to maintain in the future. Can you generate the parquet file using the parquetwriter?

@shangxinli The parquet file was created by the parquet C++ writer from Apache Arrow which aims to test the interoperability. I can generate the parquet file using the java writer but it no longer meets the requirement.

wgtmac avatar Oct 18 '22 01:10 wgtmac

Hm... any opinion on this @ggershinsky ?

shangxinli avatar Oct 18 '22 02:10 shangxinli

FWIW, Arrow/Parquet C++ checkout https://github.com/apache/parquet-testing when running parquet tests (instead of checking binary files into the main repo).

As an aside, I think we do need better interop testing between parquet implementations, it wouldn't be trivial, but it would be really nice to have compatibility tests.

emkornfield avatar Oct 18 '22 05:10 emkornfield

Yep, we test encryption interop using binary files in the parquet-testing repo. @wgtmac Please have a look at this code: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryptionOptions.java#L130

ggershinsky avatar Oct 18 '22 07:10 ggershinsky

Yep, we test encryption interop using binary files in the parquet-testing repo. @wgtmac Please have a look at this code: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryptionOptions.java#L130

@ggershinsky @emkornfield Yes I did that in the previous commit: https://github.com/apache/parquet-mr/pull/1000/commits/6e02cb20eb580e80a793b02d187237770868b335#diff-61de8043597197e5da4b42ebfe2d123b9c42e56fcb520e293c1a10ba55fa93cc but switched to local file to avoid unstable test.

@shangxinli Should I keep the resource file approach or switch to the download approach?

wgtmac avatar Oct 18 '22 09:10 wgtmac

It would probably be easier to use a git submodule (I wonder why that's not the approach being used in parquet-mr), but in any case reference files should go into the parquet-testing repo, not be privately-maintained in parquet-mr.

pitrou avatar Oct 18 '22 09:10 pitrou

Yep, we test encryption interop using binary files in the parquet-testing repo. @wgtmac Please have a look at this code: https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/test/java/org/apache/parquet/hadoop/TestEncryptionOptions.java#L130

@ggershinsky @emkornfield Yes I did that in the previous commit: 6e02cb2#diff-61de8043597197e5da4b42ebfe2d123b9c42e56fcb520e293c1a10ba55fa93cc but switched to local file to avoid unstable test.

What was the source of the instability? (encryption interop runs with files from the parquet-testing module for ~3 years now, no problems as far as I remember)

ggershinsky avatar Oct 19 '22 07:10 ggershinsky

I have changed the approach of interop test to follow encryption which downloads test files from parquet-testing repo and verifies the data decompressed and decoded from them.

In addition, the codec test has been modified to be generic which supports LZ4_RAW and SNAPPY. It should be easy to support more codec types.

Please take a look when you have time and let me know if there is any feedback, thanks! @pitrou @emkornfield @ggershinsky @shangxinli

wgtmac avatar Oct 24 '22 04:10 wgtmac

@lwhite1 Could you perhaps review this? I can't really comment on the details of the Java code, myself.

pitrou avatar Oct 25 '22 09:10 pitrou

LGTM, let's see if there are comments from others, otherwise we can merge.

shangxinli avatar Nov 02 '22 14:11 shangxinli