hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT]HUDI ParquetDecodingException caused by gzip stream CRC failure

Open ligou525 opened this issue 8 months ago • 4 comments

[!NOTE]

I am experiencing an issue related to parquet gzip decoding. Table is in cow format, the issue occurs after hundreds of commits, which is repeatable.

compress codec configuration:"hoodie.parquet.compression.codec": "GZIP".

Environment Description

  • Hudi version : 0.14.1

  • Spark version : 3.5.1

  • Hive version : 3.1.3

  • Hadoop version : 3.2.4

  • Storage (HDFS/S3/GCS..) : hdfs

  • Running on Docker? (yes/no) : no

The result of reading the parquet data by pyarrow:

import pyarrow.parquet as pq

# Specify the path to your Parquet file
parquet_file = 'your_file.parquet'

# Read the metadata
metadata = pq.read_metadata(parquet_file)

# Print overall metadata
print(metadata)

# Iterate through the row groups and print compression information
for i in range(metadata.num_row_groups):
    row_group_metadata = metadata.row_group(i)
    print(f'Row Group {i}:')
    for j in range(row_group_metadata.num_columns):
        column_metadata = row_group_metadata.column(j)
        print(f'  Column {j}:')
        print(f'    Compression: {column_metadata.compression}')
pq.read_table(file)

Image

Image

Stacktrace

	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpsertPartition(BaseJavaCommitActionExecutor.java:248)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleInsertPartition(BaseJavaCommitActionExecutor.java:254)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.lambda$execute$0(BaseJavaCommitActionExecutor.java:121)
	at java.base/java.util.LinkedHashMap.forEach(LinkedHashMap.java:986)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.execute(BaseJavaCommitActionExecutor.java:117)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.execute(BaseJavaCommitActionExecutor.java:69)
	at org.apache.hudi.table.action.commit.BaseWriteHelper.write(BaseWriteHelper.java:63)
	at org.apache.hudi.table.action.commit.JavaInsertCommitActionExecutor.execute(JavaInsertCommitActionExecutor.java:46)
	at org.apache.hudi.table.HoodieJavaCopyOnWriteTable.insert(HoodieJavaCopyOnWriteTable.java:109)
	at org.apache.hudi.table.HoodieJavaCopyOnWriteTable.insert(HoodieJavaCopyOnWriteTable.java:85)
	at org.apache.hudi.client.HoodieJavaWriteClient.insert(HoodieJavaWriteClient.java:137)
	... 9 common frames omitted
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file 
	at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:149)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpdateInternal(BaseJavaCommitActionExecutor.java:277)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpdate(BaseJavaCommitActionExecutor.java:268)
	at org.apache.hudi.table.action.commit.BaseJavaCommitActionExecutor.handleUpsertPartition(BaseJavaCommitActionExecutor.java:241)
	... 21 common frames omitted
Caused by: org.apache.hudi.exception.HoodieException: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file 
	at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:75)
	at org.apache.hudi.table.action.commit.HoodieMergeHelper.runMerge(HoodieMergeHelper.java:147)
	... 24 common frames omitted
Caused by: org.apache.hudi.exception.HoodieException: unable to read next record from parquet file 
	at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:54)
	at org.apache.hudi.common.util.collection.MappingIterator.hasNext(MappingIterator.java:39)
	at org.apache.hudi.common.util.queue.SimpleExecutor.execute(SimpleExecutor.java:67)
	... 25 common frames omitted
Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1224817 in block 14 in file hdfs://xxx/769177a1-b678-4ae7-99cf-6ffa5702f7be-0_0-0-0_20250526012739454.parquet
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:251)
	at org.apache.parquet.hadoop.ParquetReader.read(ParquetReader.java:132)
	at org.apache.hudi.common.util.ParquetReaderIterator.hasNext(ParquetReaderIterator.java:49)
	... 27 common frames omitted
Caused by: org.apache.parquet.io.ParquetDecodingException: could not read page Page [bytes.size=960924, valueCount=78373, uncompressedSize=960924] in col [md5_text] optional binary md5_text (UTF8)
	at org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:599)
	at org.apache.parquet.column.impl.ColumnReaderImpl.access$300(ColumnReaderImpl.java:57)
	at org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:536)
	at org.apache.parquet.column.impl.ColumnReaderImpl$3.visit(ColumnReaderImpl.java:533)
	at org.apache.parquet.column.page.DataPageV1.accept(DataPageV1.java:95)
	at org.apache.parquet.column.impl.ColumnReaderImpl.readPage(ColumnReaderImpl.java:533)
	at org.apache.parquet.column.impl.ColumnReaderImpl.checkRead(ColumnReaderImpl.java:525)
	at org.apache.parquet.column.impl.ColumnReaderImpl.consume(ColumnReaderImpl.java:638)
	at org.apache.parquet.column.impl.ColumnReaderImpl.<init>(ColumnReaderImpl.java:353)
	at org.apache.parquet.column.impl.ColumnReadStoreImpl.newMemColumnReader(ColumnReadStoreImpl.java:80)
	at org.apache.parquet.column.impl.ColumnReadStoreImpl.getColumnReader(ColumnReadStoreImpl.java:75)
	at org.apache.parquet.io.RecordReaderImplementation.<init>(RecordReaderImplementation.java:271)
	at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:147)
	at org.apache.parquet.io.MessageColumnIO$1.visit(MessageColumnIO.java:109)
	at org.apache.parquet.filter2.compat.FilterCompat$NoOpFilter.accept(FilterCompat.java:165)
	at org.apache.parquet.io.MessageColumnIO.getRecordReader(MessageColumnIO.java:109)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.checkRead(InternalParquetRecordReader.java:137)
	at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:222)
	... 29 common frames omitted
Caused by: java.io.IOException: gzip stream CRC failure
	at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.executeTrailerState(BuiltInGzipDecompressor.java:371)
	at org.apache.hadoop.io.compress.zlib.BuiltInGzipDecompressor.decompress(BuiltInGzipDecompressor.java:227)
	at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:111)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:105)
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:208)
	at java.base/java.io.DataInputStream.readFully(DataInputStream.java:179)
	at org.apache.parquet.bytes.BytesInput$StreamBytesInput.toByteArray(BytesInput.java:263)
	at org.apache.parquet.bytes.BytesInput.toByteBuffer(BytesInput.java:214)
	at org.apache.parquet.bytes.BytesInput.toInputStream(BytesInput.java:223)
	at org.apache.parquet.column.impl.ColumnReaderImpl.readPageV1(ColumnReaderImpl.java:592)
	... 46 common frames omitted```

ligou525 avatar May 26 '25 11:05 ligou525

@ligou525 is the parquet file corrupted? Can other tools read this file correctly, e.g, parquet-tools

cshuo avatar May 27 '25 06:05 cshuo

@cshuo Thanks for your response! As you can see in my first post, pyarrow can not read the parquet file either. I guess, the column md5_text may be conflict with the gzip compression rules, which cause the crc failures. Should I try to change the parquet compression format?

ligou525 avatar May 29 '25 13:05 ligou525

Hi @ligou525

I don't believe gzip compression is causing this issue; it's more likely that data corruption has occurred. Could you please share sample reproducible code to replicate this issue?

rangareddy avatar May 30 '25 12:05 rangareddy

@rangareddy The program use debezium to read the cdc data, and then write to hudi by java. The write logic is simple, commit the insert/upsert/delete records separately. The task is running normally after changing the parquet compress codec to snappy: hoodie.parquet.compression.codec=snappy So is there any bug for gzip?

ligou525 avatar Jun 17 '25 03:06 ligou525

Hi @ligou525

It's great to know that Snappy compression solved your problem. Since I couldn't find any specific JIRA tickets about Gzip-related data loss, I'd suggest that if the application fails again, we check the logs to understand the root cause of the data write issue.

Let me know if there's anything else I can do to help.

rangareddy avatar Aug 26 '25 11:08 rangareddy