[SUPPORT] RLI index slowing down
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced A MongoDB table is synced by AWS DMS CDC pipeline on s3 using Glue Job. The job execution time increases after few runs by 50%. Table Stats:
- Number of records at Initial Run --> 530 M
- Avg Number of records at Incremental Runs --> 5M inserts, 20K updates, 0 deletes
- Hudi Jars Used: hudi-spark3.3-bundle_2.12-0.14.0.jar hudi-aws-0.14.0.jar, httpclient-4.5.14.jar, spark-avro_2.12-3.5.0.jar
To Reproduce
Steps to reproduce the behavior:
HUDI table configuration 'hoodie.table.name': 'appsflyerevents', 'hoodie.datasource.write.precombine.field': 'upsert_ts', 'hoodie.datasource.write.recordkey.field': 'oid__id', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'appsflyerevents', 'hoodie.datasource.hive_sync.database': 'origin', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': 'creation_month', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.partitionpath.field': 'creation_month', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS', 'hoodie.cleaner.fileversions.retained': 1, 'hoodie.upsert.shuffle.parallelism': 152, 'hoodie.index.type': 'RECORD_INDEX', 'hoodie.metadata.record.index.enable': 'true', 'hoodie.metadata.record.index.growth.factor': 10, 'hoodie.metadata.record.index.max.filegroup.count': 20000, 'hoodie.metadata.record.index.min.filegroup.count': 1000, 'hoodie.metadata.record.index.max.filegroup.size': 536870912, 'hoodie.metadata.enable': 'true', 'hoodie.parquet.small.file.limit': -1, 'hoodie.metadata.clean.async': 'true', 'hoodie.metadata.keep.min.commits': '4', 'hoodie.metadata.keep.max.commits': '5', 'hoodie.datasource.meta.sync.glue.metadata_file_listing': 'true'
Expected behavior
The execution time should remain consistent and is not expected increase, significantly.
Environment Description
-
Hudi version : 0.14
-
Spark version : 3.3
-
Hive version : NA
-
Hadoop version : NA
-
Storage (HDFS/S3/GCS..) : S3
-
Running on Docker? (yes/no) : No
Additional context
Please find the spark UI attached
Stacktrace
Add the stacktrace of the error.
Spark UI files : Uploading DOC-20240516-WA0005.zip…
@manishgaurav84 Not sure why I couldn't download event logs. Can you ping me on slack and provide me there also.
@ad1happy2go I have provided the logs on slack message.
have you tried async way
spark-submit \
--class org.apache.hudi.utilities.HoodieIndexer \
--properties-file spark-config.properties \
--packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \
--master 'local[*]' \
--executor-memory 1g \
/Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
--mode scheduleAndExecute \
--base-path 's3a://huditest/hudidb/table_name=bronze_orders' \
--table-name bronze_orders \
--index-types RECORD_INDEX \
--hoodie-conf "hoodie.metadata.enable=true" \
--hoodie-conf "hoodie.metadata.record.index.enable=true" \
--hoodie-conf "hoodie.metadata.index.async=true" \
--hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
--hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
--parallelism 2 \
--spark-memory 2g
Why do we need to set hoodie.upsert.shuffle.parallelism
From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. Can you please run removing hoodie.upsert.shuffle.parallelism