[SUPPORT]HUDI ParquetDecodingException caused by gzip stream CRC failure
[!NOTE]
I am experiencing an issue related to parquet gzip decoding. Table is in cow format, the issue occurs after hundreds of commits, which is repeatable.
compress codec configuration:"hoodie.parquet.compression.codec": "GZIP".
Environment Description
-
Hudi version : 0.14.1
-
Spark version : 3.5.1
-
Hive version : 3.1.3
-
Hadoop version : 3.2.4
-
Storage (HDFS/S3/GCS..) : hdfs
-
Running on Docker? (yes/no) : no
The result of reading the parquet data by pyarrow:
import pyarrow.parquet as pq
# Specify the path to your Parquet file
parquet_file = 'your_file.parquet'
# Read the metadata
metadata = pq.read_metadata(parquet_file)
# Print overall metadata
print(metadata)
# Iterate through the row groups and print compression information
for i in range(metadata.num_row_groups):
row_group_metadata = metadata.row_group(i)
print(f'Row Group {i}:')
for j in range(row_group_metadata.num_columns):
column_metadata = row_group_metadata.column(j)
print(f' Column {j}:')
print(f' Compression: {column_metadata.compression}')
pq.read_table(file)
Stacktrace
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpsertPartition(BaseJavaCommitActionExecutor.java:248)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleInsertPartition(BaseJavaCommitActionExecutor.java:254)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.lambda$execute$0(BaseJavaCommitActionExecutor.java:121)
at java.base/java.util.LinkedHashMap.forEach(LinkedHashMap.java:986)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.execute(BaseJavaCommitActionExecutor.java:117)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.execute(BaseJavaCommitActionExecutor.java:69)
at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:63)
at org.apache.hudi.table.action.commit.JavaInsertCommitActionExecutor.execute(JavaInsertCommitActionExecutor.java:46)
at org.apache.hudi.table.HoodieJavaCopyOnWriteTable.insert(HoodieJavaCopyOnWriteTable.java:109)
at org.apache.hudi.table.HoodieJavaCopyOnWriteTable.insert(HoodieJavaCopyOnWriteTable.java:85)
at org.apache.hudi.client.HoodieJavaWriteClient.insert(HoodieJavaWriteClient.java:137)
... 9 common frames omitted
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file
at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpdateInternal(BaseJavaCommitActionExecutor.java:277)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpdate(BaseJavaCommitActionExecutor.java:268)
at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpsertPartition(BaseJavaCommitActionExecutor.java:241)
... 21 common frames omitted
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file
at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:75)
at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:147)
... 24 common frames omitted
Caused by: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file
at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
at org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67)
... 25 common frames omitted
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1224817 in block 14 in file hdfs://xxx/769177a1-b678-4ae7-99cf-6ffa5702f7be-0_0-0-0_20250526012739454.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
... 27 common frames omitted
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read page Page [bytes.size=960924, valueCount=78373, uncompressedSize=960924] in col [md5_text] optional binary md5_text (UTF8)
at org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:599)
at org.apache.parquet.column.impl.ColumnReaderImpl.access$300(ColumnReaderImpl.java:57)
at org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:536)
at org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:533)
at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:95)
at org.apache.parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:533)
at org.apache.parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:525)
at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:638)
at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:353)
at org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
... 29 common frames omitted
Caused by: java.io.IOException: gzip stream CRC failure
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeTrailerState(BuiltInGzipDecompressor.java:371)
at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:227)
at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:208)
at java.base/java.io.DataInputStream.readFully(DataInputStream.java:179)
at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:263)
at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:214)
at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:223)
at org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:592)
... 46 common frames omitted```
@ligou525 is the parquet file corrupted? Can other tools read this file correctly, e.g, parquet-tools
@cshuo Thanks for your response! As you can see in my first post, pyarrow can not read the parquet file either. I guess, the column md5_text may be conflict with the gzip compression rules, which cause the crc failures. Should I try to change the parquet compression format?
Hi @ligou525
I don't believe gzip compression is causing this issue; it's more likely that data corruption has occurred. Could you please share sample reproducible code to replicate this issue?
@rangareddy The program use debezium to read the cdc data, and then write to hudi by java. The write logic is simple, commit the insert/upsert/delete records separately. The task is running normally after changing the parquet compress codec to snappy: hoodie.parquet.compression.codec=snappy So is there any bug for gzip?
Hi @ligou525
It's great to know that Snappy compression solved your problem. Since I couldn't find any specific JIRA tickets about Gzip-related data loss, I'd suggest that if the application fails again, we check the logs to understand the root cause of the data write issue.
Let me know if there's anything else I can do to help.