hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Poor Upsert Performance on COW table due to indexing

Open jtm437 opened this issue 3 years ago • 5 comments

Hello, I am having performance issues when attempting to upsert data into a Hudi COW table. With the below specs it is taking longer than 4 hours to finish upserting (if it ever does finish). In the screenshots below, you can see that it is taking a long time doing the index scan. I have tried disabling hoodie.bloom.index.prune.by.ranges because our record key is random. I've also tried upserting using the "Simple" index type and did not see any performance improvements. Is there anything else I can do to improve the performance?

image image

Specs: Table Size: 13.6TB (compressed in S3) Number of partitions: 1135 (hoodie.datasource.hive_sync.partition_fields=year,month) Upsert dataset size: 68 million records, 6GB compressed Index type: Default (Bloom) Number of nodes: 30 Node type: r6g.8xlarge Average record size: ~40 bytes (calculated by File Size/Num Records: 10MB/250000 records)

Environment Description

  • Hudi version : 0.9.0
  • Spark version : 2.4.8
  • EMR version: 5.34.0
  • Hive version : 2.38.0
  • Hadoop version : Amazon 2.10.1
  • Storage (HDFS/S3/GCS..) : S3

jtm437 avatar Sep 16 '22 00:09 jtm437

It is recommended that you upgrade to the latest version of hudi and use bucket index. Or try to use hbase index and MOR table formats in version 0.9.0

scxwhite avatar Sep 29 '22 10:09 scxwhite

@scxwhite can you point me at some documentation on implementing bucket or hbase indexes?

jtm437 avatar Sep 29 '22 10:09 jtm437

You can see how to use these indexes in the official documents. If you want to know more about bucket index. Take a look at this document.

scxwhite avatar Sep 30 '22 02:09 scxwhite

You can enable clustering to increase the file sizes. by default file sizes are of 120MB. but you can try to batch small files into larger ones (500Mb) and so during index lookup, the no of files to be looked up could reduce.

nsivabalan avatar Oct 22 '22 23:10 nsivabalan

Hi Can try https://hudi.apache.org/blog/2023/11/01/record-level-index/#metadata-table this stores the record_keys in metadata tables. But I am not sure if this indexing can be applied for COW tables.

bibhu107 avatar May 02 '24 11:05 bibhu107