hudi [SUPPORT] - Hudi Read on a MOR table is failing with ArrayIndexOutOfBound exception

Tips before filing an issue

Have you gone through our FAQs?
Join the mailing list to engage in conversations and get faster support at [email protected].
If you have triaged this as a bug, then file an issue directly.

Describe the problem you faced

A clear and concise description of the problem.

To Reproduce

Steps to reproduce the behavior:

Expected behavior

A clear and concise description of what you expected to happen.

Environment Description

Hudi version :
Spark version :
Hive version :
Hadoop version :
Storage (HDFS/S3/GCS..) :
Running on Docker? (yes/no) :

Additional context

Add any other context about the problem here.

Stacktrace

Add the stacktrace of the error.

Detailed Notes -

We have Incoming Delta transactions from an Oracle based application that are being pushed into S3 endpoint using AWS DMS services. These CDC records are applied as upserts on to already existing Hudi table in a different S3 bucket (Initial Load data). The UPSERTS are happening by running below Spark Submits -

spark-submit
--deploy-mode client
--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer
--conf spark.shuffle.service.enabled=true
--conf spark.default.parallelism=500
--conf spark.dynamicAllocation.enabled=true
--conf spark.dynamicAllocation.initialExecutors=3
--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=90s
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.app.name=<table_1>
--jars /usr/lib/spark/external/lib/spark-avro.jar,/usr/lib/hive/lib/hbase-client.jar /usr/lib/hudi/hudi-utilities-bundle.jar
--table-type MERGE_ON_READ
--op UPSERT
--hoodie-conf hoodie.datasource.hive_sync.jdbcurl=jdbc:hive2://localhost:10000
--source-ordering-field dms_seq_no
--props s3://bucket/cdc.properties
--hoodie-conf hoodie.datasource.hive_sync.database=glue_db
--target-base-path s3://bucket/table_1
--target-table table_1
--transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer
--hoodie-conf hoodie.deltastreamer.source.dfs.root=s3://bucket/
--source-class org.apache.hudi.utilities.sources.ParquetDFSSource --enable-sync

This table will be subsequently read with hudi options and joined with other hudi tables to populate the Final Enriched layer. While reading a Hudi table we are facing the ArrayIndexOutOfbound exception.

Below are the Hudi props and Spark Submits we execute to read and populate the downstream.

hoodie.datasource.write.partitionpath.field= hoodie.datasource.write.keygenerator.class=org.apache.hudi.keygen.NonpartitionedKeyGenerator hoodie.datasource.hive_sync.enable=true hoodie.datasource.hive_sync.assume_date_partitioning=false hoodie.datasource.hive_sync.partition_extractor_class=org.apache.hudi.hive.NonPartitionedExtractor hoodie.parquet.small.file.limit=134217728 hoodie.parquet.max.file.size=1048576000 hoodie.cleaner.policy=KEEP_LATEST_COMMITS hoodie.cleaner.commits.retained=1 hoodie.deltastreamer.transformer.sql=select CASE WHEN Op='D' THEN TRUE ELSE FALSE END AS _hoodie_is_deleted,* from <SRC> hoodie.datasource.hive_sync.support_timestamp=true hoodie.datasource.compaction.async.enable=true hoodie.index.type=BLOOM hoodie.compact.inline=true hoodiecompactionconfig.INLINE_COMPACT_NUM_DELTA_COMMITS_PROP=5 hoodie.metadata.compact.max.delta.commits=5 hoodie.clean.automatic=true hoodie.clean.async=true hoodie.datasource.hive_sync.table=table_1 hoodie.datasource.write.recordkey.field=table_1_ID

spark-submit --deploy-mode client --conf spark.yarn.appMasterEnv.SPARK_HOME=/prod/null --conf spark.executorEnv.SPARK_HOME=/prod/null --conf spark.shuffle.service.enabled=true --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar s3://pythonscripts/hudi_read.py

TaskSetManager: Lost task 32.2 in stage 6.0 (TID 253) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 1] 22/07/21 15:50:26 INFO TaskSetManager: Starting task 32.3 in stage 6.0 (TID 296, ip-172-31-16-236.ec2.internal, executor 1, partition 32, PROCESS_LOCAL, 8887 bytes) 22/07/21 15:50:26 INFO TaskSetManager: Lost task 33.2 in stage 6.0 (TID 256) on ip-172-31-16-236.ec2.internal, executor 1: java.lang.ArrayIndexOutOfBoundsException (null) [duplicate 2]

Jul 29 '22 18:07 soma1712

@soma1712 could you share how you read the Hudi table in s3://pythonscripts/hudi_read.py and the full stacktrace as well? Which Hudi release do you use?

Jul 30 '22 00:07 yihua

hudi_read.txt is actually a .py file. As the system was not supporting to update a .py, I had to change it to .txt

hudi_read.txt results.txt

Hudi Version - 0.11.1

Please let me know if you need more details.

Jul 30 '22 04:07 soma1712

can you share contents of .hoodie if its feasible.

Aug 27 '22 21:08 nsivabalan

@soma1712 if you can please fill out the issue following the format, that would greatly help us in investigation (for ex, what version of Hudi are you using? Spark?)

Can you please paste full stack-trace of the exception?

Sep 09 '22 22:09 alexeykudinkin

@soma1712 are you sure you're running 0.11.1?

It seems like you're running an earlier version: looking at your stacktrace, i see method being used (createRowWithRequiredSchema) which was removed in 0.11.1:

22/07/30 04:09:14 WARN TaskSetManager: Lost task 2.0 in stage 2.0 (TID 7, ip-172-31-18-101.ec2.internal, executor 2): java.lang.ArrayIndexOutOfBoundsException: 9157
        at org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainLongDictionary.decodeToLong(PlainValuesDictionary.java:165)
        at org.apache.spark.sql.execution.datasources.parquet.ParquetDictionary.decodeToLong(ParquetDictionary.java:36)
        at org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:364)
        at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.getLong(MutableColumnarRow.java:120)
        at org.apache.spark.sql.execution.vectorized.MutableColumnarRow.get(MutableColumnarRow.java:189)
        at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3$$anonfun$createRowWithRequiredSchema$1.apply(HoodieMergeOnReadRDD.scala:278)
        at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3$$anonfun$createRowWithRequiredSchema$1.apply(HoodieMergeOnReadRDD.scala:276)
        at scala.collection.Iterator$class.foreach(Iterator.scala:891)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
        at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
        at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
        at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.createRowWithRequiredSchema(HoodieMergeOnReadRDD.scala:275)
        at org.apache.hudi.HoodieMergeOnReadRDD$$anon$3.hasNext(HoodieMergeOnReadRDD.scala:236)
        at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
        at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:585)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:297)
        at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:289)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:858)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:346)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:310)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:123)
        at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)

Sep 09 '22 22:09 alexeykudinkin

@soma1712 I have been running a deltastreamer MOR table job for over 30+ commits with sql transformer (though my setup is a bit different, I'm not using DMS). So far it's going well. No issues. We have had long-running deltastreamer tests in our test infra. Could you try with the latest Hudi?

Sep 30 '22 10:09 codope

Closing due to inactivity and non-reproducibility.

Oct 12 '22 15:10 codope