hudi [SUPPORT] Poor Upsert Performance on COW table due to indexing

Hello, I am having performance issues when attempting to upsert data into a Hudi COW table. With the below specs it is taking longer than 4 hours to finish upserting (if it ever does finish). In the screenshots below, you can see that it is taking a long time doing the index scan. I have tried disabling hoodie.bloom.index.prune.by.ranges because our record key is random. I've also tried upserting using the "Simple" index type and did not see any performance improvements. Is there anything else I can do to improve the performance?

Specs: Table Size: 13.6TB (compressed in S3) Number of partitions: 1135 (hoodie.datasource.hive_sync.partition_fields=year,month) Upsert dataset size: 68 million records, 6GB compressed Index type: Default (Bloom) Number of nodes: 30 Node type: r6g.8xlarge Average record size: ~40 bytes (calculated by File Size/Num Records: 10MB/250000 records)

Environment Description

Hudi version : 0.9.0
Spark version : 2.4.8
EMR version: 5.34.0
Hive version : 2.38.0
Hadoop version : Amazon 2.10.1
Storage (HDFS/S3/GCS..) : S3

Sep 16 '22 00:09 jtm437

It is recommended that you upgrade to the latest version of hudi and use bucket index. Or try to use hbase index and MOR table formats in version 0.9.0

Sep 29 '22 10:09 scxwhite

@scxwhite can you point me at some documentation on implementing bucket or hbase indexes?

Sep 29 '22 10:09 jtm437

You can see how to use these indexes in the official documents. If you want to know more about bucket index. Take a look at this document.

Sep 30 '22 02:09 scxwhite

You can enable clustering to increase the file sizes. by default file sizes are of 120MB. but you can try to batch small files into larger ones (500Mb) and so during index lookup, the no of files to be looked up could reduce.

Oct 22 '22 23:10 nsivabalan

Hi Can try https://hudi.apache.org/blog/2023/11/01/record-level-index/#metadata-table this stores the record_keys in metadata tables. But I am not sure if this indexing can be applied for COW tables.

May 02 '24 11:05 bibhu107