hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] RLI index slowing down

Open manishgaurav84 opened this issue 1 year ago • 5 comments

Tips before filing an issue

  • Have you gone through our FAQs?

  • Join the mailing list to engage in conversations and get faster support at [email protected].

  • If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced A MongoDB table is synced by AWS DMS CDC pipeline on s3 using Glue Job. The job execution time increases after few runs by 50%. Table Stats:

  1. Number of records at Initial Run --> 530 M
  2. Avg Number of records at Incremental Runs --> 5M inserts, 20K updates, 0 deletes
  3. Hudi Jars Used: hudi-spark3.3-bundle_2.12-0.14.0.jar hudi-aws-0.14.0.jar, httpclient-4.5.14.jar, spark-avro_2.12-3.5.0.jar

To Reproduce

Steps to reproduce the behavior:

HUDI table configuration 'hoodie.table.name': 'appsflyerevents', 'hoodie.datasource.write.precombine.field': 'upsert_ts', 'hoodie.datasource.write.recordkey.field': 'oid__id', 'hoodie.datasource.write.table.type': 'COPY_ON_WRITE', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.table': 'appsflyerevents', 'hoodie.datasource.hive_sync.database': 'origin', 'hoodie.datasource.hive_sync.mode': 'hms', 'hoodie.datasource.write.hive_style_partitioning': 'true', 'hoodie.datasource.hive_sync.partition_fields': 'creation_month', 'hoodie.datasource.hive_sync.partition_extractor_class': 'org.apache.hudi.hive.MultiPartKeysValueExtractor', 'hoodie.datasource.write.partitionpath.field': 'creation_month', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.SimpleKeyGenerator', 'hoodie.datasource.write.operation': 'upsert', 'hoodie.cleaner.policy': 'KEEP_LATEST_FILE_VERSIONS', 'hoodie.cleaner.fileversions.retained': 1, 'hoodie.upsert.shuffle.parallelism': 152, 'hoodie.index.type': 'RECORD_INDEX', 'hoodie.metadata.record.index.enable': 'true', 'hoodie.metadata.record.index.growth.factor': 10, 'hoodie.metadata.record.index.max.filegroup.count': 20000, 'hoodie.metadata.record.index.min.filegroup.count': 1000, 'hoodie.metadata.record.index.max.filegroup.size': 536870912, 'hoodie.metadata.enable': 'true', 'hoodie.parquet.small.file.limit': -1, 'hoodie.metadata.clean.async': 'true', 'hoodie.metadata.keep.min.commits': '4', 'hoodie.metadata.keep.max.commits': '5', 'hoodie.datasource.meta.sync.glue.metadata_file_listing': 'true'

Expected behavior

The execution time should remain consistent and is not expected increase, significantly.

Environment Description

  • Hudi version : 0.14

  • Spark version : 3.3

  • Hive version : NA

  • Hadoop version : NA

  • Storage (HDFS/S3/GCS..) : S3

  • Running on Docker? (yes/no) : No

Additional context

Please find the spark UI attached

Stacktrace

Add the stacktrace of the error.

manishgaurav84 avatar May 16 '24 05:05 manishgaurav84

Spark UI files : Uploading DOC-20240516-WA0005.zip…

manishgaurav84 avatar May 16 '24 05:05 manishgaurav84

@manishgaurav84 Not sure why I couldn't download event logs. Can you ping me on slack and provide me there also.

ad1happy2go avatar May 17 '24 10:05 ad1happy2go

@ad1happy2go I have provided the logs on slack message.

manishgaurav84 avatar May 17 '24 12:05 manishgaurav84

have you tried async way


spark-submit \
    --class org.apache.hudi.utilities.HoodieIndexer \
     --properties-file spark-config.properties \
    --packages 'org.apache.hudi:hudi-spark3.4-bundle_2.12:0.14.0,org.apache.hadoop:hadoop-aws:3.3.2' \
    --master 'local[*]' \
    --executor-memory 1g \
    /Users/soumilshah/IdeaProjects/SparkProject/DeltaStreamer/jar/hudi-utilities-slim-bundle_2.12-0.14.0.jar \
     --mode scheduleAndExecute \
    --base-path 's3a://huditest/hudidb/table_name=bronze_orders' \
    --table-name bronze_orders \
    --index-types RECORD_INDEX \
    --hoodie-conf "hoodie.metadata.enable=true" \
    --hoodie-conf "hoodie.metadata.record.index.enable=true" \
    --hoodie-conf "hoodie.metadata.index.async=true" \
    --hoodie-conf "hoodie.write.concurrency.mode=optimistic_concurrency_control" \
    --hoodie-conf "hoodie.write.lock.provider=org.apache.hudi.client.transaction.lock.InProcessLockProvider" \
    --parallelism 2 \
    --spark-memory 2g

soumilshah1995 avatar May 28 '24 16:05 soumilshah1995

Why do we need to set hoodie.upsert.shuffle.parallelism

From 0.13.0 onwards Hudi by default automatically uses the parallelism deduced by Spark based on the source data. If the shuffle parallelism is explicitly configured by the user, the user-configured parallelism is used in defining the actual parallelism. Can you please run removing hoodie.upsert.shuffle.parallelism

bibhu107 avatar Jun 19 '24 18:06 bibhu107