hudi [SUPPORT] RLI index slowing down

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced A MongoDB table is synced by AWS DMS CDC pipeline on s3 using Glue Job. The job execution time increases after few runs by 50%. Table Stats:

Number of records at Initial Run --> 530 M
Avg Number of records at Incremental Runs --> 5M inserts, 20K updates, 0 deletes
Hudi Jars Used: hudi-spark3.3-bundle_2.12-0.14.0.jar hudi-aws-0.14.0.jar, httpclient-4.5.14.jar, spark-avro_2.12-3.5.0.jar

To Reproduce

Steps to reproduce the behavior:

HUDI table configuration 'hoodie.table.name': 'appsflyerevents', 'hoodie.datasource.write.precombine.field': 'upsert_ts', 'hoodie.datasource.write.recordkey.field': 'oid__id', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'appsflyerevents', 'hoodie.datasource.hive_sync.database': 'origin', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': 'creation_month', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.partitionpath.field': 'creation_month', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS', 'hoodie.cleaner.fileversions.retained': 1, 'hoodie.upsert.shuffle.parallelism': 152, 'hoodie.index.type': 'RECORD_INDEX', 'hoodie.metadata.record.index.enable': 'true', 'hoodie.metadata.record.index.growth.factor': 10, 'hoodie.metadata.record.index.max.filegroup.count': 20000, 'hoodie.metadata.record.index.min.filegroup.count': 1000, 'hoodie.metadata.record.index.max.filegroup.size': 536870912, 'hoodie.metadata.enable': 'true', 'hoodie.parquet.small.file.limit': -1, 'hoodie.metadata.clean.async': 'true', 'hoodie.metadata.keep.min.commits': '4', 'hoodie.metadata.keep.max.commits': '5', 'hoodie.datasource.meta.sync.glue.metadata_file_listing': 'true'

Expected behavior

The execution time should remain consistent and is not expected increase, significantly.

Environment Description

Hudi version : 0.14
Spark version : 3.3
Hive version : NA
Hadoop version : NA
Storage (HDFS/S3/GCS..) : S3
Running on Docker? (yes/no) : No

Additional context

Please find the spark UI attached

Stacktrace

Add the stacktrace of the error.

May 16 '24 05:05 manishgaurav84

Spark UI files : Uploading DOC-20240516-WA0005.zip…

May 16 '24 05:05 manishgaurav84

@manishgaurav84 Not sure why I couldn't download event logs. Can you ping me on slack and provide me there also.

May 17 '24 10:05 ad1happy2go

@ad1happy2go I have provided the logs on slack message.

May 17 '24 12:05 manishgaurav84

have you tried async way


spark-submit \
    --class org.apache.hudi.utilities.HoodieIndexer \
     --properties-file spark-config.properties \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
     --mode scheduleAndExecute \
    --base-path 's3a://huditest/hudidb/table_name=bronze_orders' \
    --table-name bronze_orders \
    --index-types RECORD_INDEX \
    --hoodie-conf "hoodie.metadata.enable=true" \
    --hoodie-conf "hoodie.metadata.record.index.enable=true" \
    --hoodie-conf "hoodie.metadata.index.async=true" \
    --hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
    --hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
    --parallelism 2 \
    --spark-memory 2g

May 28 '24 16:05 soumilshah1995

Why do we need to set hoodie.upsert.shuffle.parallelism

From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. Can you please run removing hoodie.upsert.shuffle.parallelism

Jun 19 '24 18:06 bibhu107