[SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception
Tips before filing an issue
-
Have you gone through our FAQs?
-
Join the mailing list to engage in conversations and get faster support at [email protected].
-
If you have triaged this as a bug, then file an issue directly.
Describe the problem you faced
A clear and concise description of the problem.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Environment Description
-
Hudi version :
-
Spark version :
-
Hive version :
-
Hadoop version :
-
Storage (HDFS/S3/GCS..) :
-
Running on Docker? (yes/no) :
Additional context
Add any other context about the problem here.
Stacktrace
Add the stacktrace of the error.
Detailed Notes -
We have Incoming Delta transactions from an Oracle based application that are being pushed into S3 endpoint using AWS DMS services. These CDC records are applied as upserts on to already existing Hudi table in a different S3 bucket (Initial Load data). The UPSERTS are happening by running below Spark Submits -
spark-submit
--deploy-mode client
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
--conf spark.shuffle.service.enabled=true
--conf spark.default.parallelism=500
--conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.initialExecutors=3
--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.app.name=<table_1>
--jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar /usr/lib/hudi/hudi-utilities-bundle.jar
--table-type MERGE_ON_READ
--op UPSERT
--hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000
--source-ordering-field dms_seq_no
--props s3://bucket/cdc.properties
--hoodie-conf hoodie.datasource.hive_sync.database=glue_db
--target-base-path s3://bucket/table_1
--target-table table_1
--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-sync
This table
Below are the Hudi props and Spark Submits we execute to read and populate the downstream.
hoodie.datasource.write.partitionpath.field= hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.assume_date_partitioning=false hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.small.file.limit=134217728 hoodie.parquet.max.file.size=1048576000 hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=1 hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,* from <SRC> hoodie.datasource.hive_sync.support_timestamp=true hoodie.datasource.compaction.async.enable=true hoodie.index.type=BLOOM hoodie.compact.inline=true hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5 hoodie.metadata.compact.max.delta.commits=5 hoodie.clean.automatic=true hoodie.clean.async=true hoodie.datasource.hive_sync.table=table_1 hoodie.datasource.write.recordkey.field=table_1_ID
spark-submit --deploy-mode client --conf spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf spark.executorEnv.SPARK_HOME=/prod/null --conf spark.shuffle.service.enabled=true --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar s3://pythonscripts/hudi_read.py
TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 8887 bytes) 22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2]
@soma1712 could you share how you read the Hudi table in s3://pythonscripts/hudi_read.py and the full stacktrace as well? Which Hudi release do you use?
hudi_read.txt is actually a .py file. As the system was not supporting to update a .py, I had to change it to .txt
Hudi Version - 0.11.1
Please let me know if you need more details.
can you share contents of .hoodie if its feasible.
@soma1712 if you can please fill out the issue following the format, that would greatly help us in investigation (for ex, what version of Hudi are you using? Spark?)
Can you please paste full stack-trace of the exception?
@soma1712 are you sure you're running 0.11.1?
It seems like you're running an earlier version: looking at your stacktrace, i see method being used (createRowWithRequiredSchema) which was removed in 0.11.1:
22/07/30 04:09:14 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 7, ip-172-31-18-101.ec2.internal, executor 2): java.lang.ArrayIndexOutOfBoundsException: 9157
at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong(PlainValuesDictionary.java:165)
at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getLong(MutableColumnarRow.java:120)
at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:189)
at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3$$anonfun$createRowWithRequiredSchema$1.apply(HoodieMergeOnReadRDD.scala:278)
at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3$$anonfun$createRowWithRequiredSchema$1.apply(HoodieMergeOnReadRDD.scala:276)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:275)
at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:236)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:297)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:289)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:123)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
@soma1712 I have been running a deltastreamer MOR table job for over 30+ commits with sql transformer (though my setup is a bit different, I'm not using DMS). So far it's going well. No issues. We have had long-running deltastreamer tests in our test infra. Could you try with the latest Hudi?
Closing due to inactivity and non-reproducibility.