[SUPPORT] Poor Upsert Performance on COW table due to indexing
Hello, I am having performance issues when attempting to upsert data into a Hudi COW table. With the below specs it is taking longer than 4 hours to finish upserting (if it ever does finish). In the screenshots below, you can see that it is taking a long time doing the index scan. I have tried disabling hoodie.bloom.index.prune.by.ranges because our record key is random. I've also tried upserting using the "Simple" index type and did not see any performance improvements. Is there anything else I can do to improve the performance?
Specs: Table Size: 13.6TB (compressed in S3) Number of partitions: 1135 (hoodie.datasource.hive_sync.partition_fields=year,month) Upsert dataset size: 68 million records, 6GB compressed Index type: Default (Bloom) Number of nodes: 30 Node type: r6g.8xlarge Average record size: ~40 bytes (calculated by File Size/Num Records: 10MB/250000 records)
Environment Description
- Hudi version : 0.9.0
- Spark version : 2.4.8
- EMR version: 5.34.0
- Hive version : 2.38.0
- Hadoop version : Amazon 2.10.1
- Storage (HDFS/S3/GCS..) : S3
It is recommended that you upgrade to the latest version of hudi and use bucket index. Or try to use hbase index and MOR table formats in version 0.9.0
@scxwhite can you point me at some documentation on implementing bucket or hbase indexes?
You can see how to use these indexes in the official documents. If you want to know more about bucket index. Take a look at this document.
You can enable clustering to increase the file sizes. by default file sizes are of 120MB. but you can try to batch small files into larger ones (500Mb) and so during index lookup, the no of files to be looked up could reduce.
Hi Can try https://hudi.apache.org/blog/2023/11/01/record-level-index/#metadata-table this stores the record_keys in metadata tables. But I am not sure if this indexing can be applied for COW tables.