hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] HoodieException: Failed to apply clean commit to metadata - HUDI 0.15

Open sushant-searce opened this issue 1 year ago • 5 comments

Hello Community,

We are facing issue with the streaming pipeline with hudi 0.15 (hoodie.metadata.enable=True) while migrating from 0.12 to 0.15 Pipeline is running successfully with hoodie.metadata.enable=False but failing with hoodie.metadata.enable=True

Need your help to debug this issue, if you guys have faced similar issues in past please let me know

ISSUE : 1

[2024-08-11 21:22:35,835] ERROR org.apache.hudi.exception.HoodieException: Error waiting for async clean service to finish (com.pe.skull.titan.utils.SparkUtils) [2024-08-11 21:22:35,836] ERROR writing data for pe_oms_product_oms.product_order_eligible_payment_modes_snapshot_nrt attempt: 3 (com.pe.skull.titan.utils.SparkUtils) [2024-08-11 21:22:40,290] ERROR Service shutdown with error (org.apache.hudi.async.HoodieAsyncService) java.util.concurrent.ExecutionException: org.apache.hudi.exception.HoodieException: Failed to apply clean commit to metadata at java.base/java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:395) ~[?:?] at java.base/java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1999) ~[?:?]

Caused by: org.apache.hudi.exception.HoodieException: Failed to apply clean commit to metadata at org.apache.hudi.table.action.BaseActionExecutor.writeTableMetadata(BaseActionExecutor.java:93) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.action.clean.CleanActionExecutor.runClean(CleanActionExecutor.java:235) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.action.clean.CleanActionExecutor.runPendingClean(CleanActionExecutor.java:200) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]

Caused by: java.lang.IllegalArgumentException: File gs://pe-skull-data/pe/processed/nrt_snapshot/data/pe_oms_product_oms/product_order_eligible_payment_modes_snapshot_nrt/.hoodie/metadata/.hoodie/20240811212153175.deltacommit.requested does not exist! at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:617) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionRequestedToInflight(HoodieActiveTimeline.java:683) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]

Caused by: java.lang.IllegalArgumentException: File gs://pe-skull-data/pe/processed/nrt_snapshot/data/pe_oms_product_oms/product_order_eligible_payment_modes_snapshot_nrt/.hoodie/metadata/.hoodie/20240811212153175.deltacommit.requested does not exist! at org.apache.hudi.common.util.ValidationUtils.checkArgument(ValidationUtils.java:42) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.common.table.timeline.HoodieActiveTimeline.transitionState(HoodieActiveTimeline.java:617) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]

ISSUE : 2

[2024-08-10 01:21:09,735] ERROR Failed to perform previous clean operation, instant: [==>20240810012008022__clean__INFLIGHT__20240810012011211] (org.apache.hudi.table.action.clean.CleanActionExecutor) org.apache.hudi.exception.HoodieCommitException: Failed to complete commit 20240810012008022 due to finalize errors. at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.finalizeWrite(BaseCommitActionExecutor.java:223) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.action.commit.BaseSparkCommitActionExecutor.commit(BaseSparkCommitActionExecutor.java:303) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.autoCommit(BaseCommitActionExecutor.java:202) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.commitOnAutoCommit(BaseCommitActionExecutor.java:184) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0]

Caused by: org.apache.hudi.exception.HoodieIOException: listStatus(hadoopPath: gs://pe-skull-data/pe/processed/nrt_snapshot/data/pe_mercury_vault/bk_vendor_snapshot_nrt/.hoodie/metadata/.hoodie/.temp/20240810012008022): 'gs://pe-skull-data/pe/processed/nrt_snapshot/data/pe_mercury_vault/bk_vendor_snapshot_nrt/.hoodie/metadata/.hoodie/.temp/20240810012008022' does not exist. at org.apache.hudi.table.HoodieTable.reconcileAgainstMarkers(HoodieTable.java:784) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.HoodieTable.finalizeWrite(HoodieTable.java:696) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.HoodieSparkMergeOnReadTable.finalizeWrite(HoodieSparkMergeOnReadTable.java:230) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] at org.apache.hudi.table.action.commit.BaseCommitActionExecutor.finalizeWrite(BaseCommitActionExecutor.java:220) ~[hudi-spark3.5-bundle_2.12-0.15.0.jar:0.15.0] ... 27 more Caused by: java.io.FileNotFoundException: listStatus(hadoopPath: gs://pe-skull-data/pe/processed/nrt_snapshot/data/pe_mercury_vault/bk_vendor_snapshot_nrt/.hoodie/metadata/.hoodie/.temp/20240810012008022): 'gs://pe-skull-data/pe/processed/nrt_snapshot/data/pe_mercury_vault/bk_vendor_snapshot_nrt/.hoodie/metadata/.hoodie/.temp/20240810012008022' does not exist. at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem.lambda$listStatus$11(GoogleHadoopFileSystem.java:801) ~[gcs-connector-3.0.1.jar:?] at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547) ~[hadoop-client-api-3.3.6.jar:?]

Hoodie Options Used :

hoodie.cleaner.commits.retained: 10 hoodie.metadata.keep.max.commits: 30 hoodie.metadata.clean.async: false hoodie.keep.max.commits: 30 hoodie.metadata.keep.min.commits: 20 hoodie.archive.async: false hoodie.clean.automatic: true hoodie.finalize.write.parallelism: 200 hoodie.fail.on.timeline.archiving: false hoodie.clean.async: true hoodie.parquet.max.file.size: 128000000 hoodie.datasource.hive_sync.support_timestamp : true #DISABLING METADATA TO REDUCE FREQUENT CALLS TO GCS hoodie.metadata.enable: false hoodie.datasource.write.hive_style_partitioning : true hoodie.parquet.small.file.limit: 100000000 hoodie.datasource.hive_sync.enable: true hoodie.bulkinsert.shuffle.parallelism: 200 hoodie.keep.min.commits: 11 hoodie.datasource.meta.sync.enable: true hoodie.metadata.cleaner.commits.retained: 3 hoodie.cleaner.incremental.mode: true hoodie.commits.archival.batch: 12 hoodie.upsert.shuffle.parallelism: 200 hive_sync.support_timestamp: true hoodie.insert.shuffle.parallelism: 200 hoodie.metadata.compact.max.delta.commits: 10 compaction.delta_commits: 5 metadata.compaction.delta_commits: 10 hoodie.compact.inline.max.delta.commits: 5 hoodie.archive.automatic: true hoodie.cleaner.parallelism: 200

sushant-searce avatar Aug 14 '24 09:08 sushant-searce

cc @nsivabalan for visibility.

danny0405 avatar Aug 14 '24 10:08 danny0405

@sushant-searce Looks like your metadata table seems to be corrupted somehow. can you provide us the timeline for the both table and metadata table. If it's fine, you can zip the .hoodie directory and attach here.

The root cause for the same may be the async clean. You can try inline clean and see if you see this issue.

ad1happy2go avatar Aug 14 '24 10:08 ad1happy2go

@ad1happy2go

I wont be able to share .hoodie directory due to confidentiality.

Made below changes in Hoodie Options hoodie.clean.async: false hoodie.cleaner.policy: KEEP_LATEST_FILE_VERSIONS

It's a streaming pipeline so it's running right now but will let you know if pipelines fails with any issue in couple of hours.

Keep this ISSUE open for couple of days, we will monitor this streaming pipeline meantime.

sushant-searce avatar Aug 14 '24 11:08 sushant-searce

Thanks @sushant-searce for the update.

ad1happy2go avatar Aug 14 '24 12:08 ad1happy2go

@sushant-searce Are we good to close this issue ?

ad1happy2go avatar Oct 17 '24 12:10 ad1happy2go

Hello @ad1happy2go

Pipeline was running in production for past 2 months but all the sudden pipelines are failing

sharingError Traces with you for one of the pipeline Can anyone help here as it's very urgent and causing production outage

ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81) at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:1015) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriterInternal.metaSync(HoodieSparkSqlWriter.scala:1013) at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1112) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:473) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:473) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:449) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) at com.pe.skull.titan.utils.SparkUtils.writeToTable(SparkUtils.java:137) at com.pe.skull.titan.tasks.PipelineRunner.writeData(PipelineRunner.java:154) at com.pe.skull.titan.tasks.PipelineRunner.processBatch(PipelineRunner.java:118) at com.pe.skull.titan.tasks.PipelineRunner.lambda$startPipelines$51830645$1(PipelineRunner.java:77) at org.apache.spark.sql.streaming.DataStreamWriter.$anonfun$foreachBatch$1(DataStreamWriter.scala:505) at org.apache.spark.sql.streaming.DataStreamWriter.$anonfun$foreachBatch$1$adapted(DataStreamWriter.scala:505) at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:34) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:732) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:729) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:729) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:286) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239) at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing product_versions_snapshot_nrt at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:170) at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:79) ... 71 more Caused by: java.lang.NullPointerException at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$null$5(TimelineUtils.java:114) at java.base/java.util.HashMap.forEach(HashMap.java:1337) at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getDroppedPartitions$6(TimelineUtils.java:113) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) at org.apache.hudi.common.table.timeline.TimelineUtils.getDroppedPartitions(TimelineUtils.java:110) at org.apache.hudi.sync.common.HoodieSyncClient.getDroppedPartitionsSince(HoodieSyncClient.java:97) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:289) at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:179) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:167) ... 72 more org.apache.hudi.exception.HoodieMetaSyncException: Could not sync using the meta sync class org.apache.hudi.hive.HiveSyncTool at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:81) at org.apache.hudi.HoodieSparkSqlWriterInternal.$anonfun$metaSync$2(HoodieSparkSqlWriter.scala:1015) at scala.collection.mutable.HashSet.foreach(HashSet.scala:79) at org.apache.hudi.HoodieSparkSqlWriterInternal.metaSync(HoodieSparkSqlWriter.scala:1013) at org.apache.hudi.HoodieSparkSqlWriterInternal.commitAndPerformPostOperations(HoodieSparkSqlWriter.scala:1112) at org.apache.hudi.HoodieSparkSqlWriterInternal.writeInternal(HoodieSparkSqlWriter.scala:508) at org.apache.hudi.HoodieSparkSqlWriterInternal.write(HoodieSparkSqlWriter.scala:187) at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:125) at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:168) at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:48) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:107) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:107) at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:98) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:473) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:76) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:473) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:32) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:449) at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:98) at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:85) at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:142) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:859) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:388) at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:361) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:240) at com.pe.skull.titan.utils.SparkUtils.writeToTable(SparkUtils.java:137) at com.pe.skull.titan.tasks.PipelineRunner.writeData(PipelineRunner.java:154) at com.pe.skull.titan.tasks.PipelineRunner.processBatch(PipelineRunner.java:118) at com.pe.skull.titan.tasks.PipelineRunner.lambda$startPipelines$51830645$1(PipelineRunner.java:77) at org.apache.spark.sql.streaming.DataStreamWriter.$anonfun$foreachBatch$1(DataStreamWriter.scala:505) at org.apache.spark.sql.streaming.DataStreamWriter.$anonfun$foreachBatch$1$adapted(DataStreamWriter.scala:505) at org.apache.spark.sql.execution.streaming.sources.ForeachBatchSink.addBatch(ForeachBatchSink.scala:34) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:732) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:125) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:201) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:108) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$16(MicroBatchExecution.scala:729) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runBatch(MicroBatchExecution.scala:729) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$2(MicroBatchExecution.scala:286) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken(ProgressReporter.scala:427) at org.apache.spark.sql.execution.streaming.ProgressReporter.reportTimeTaken$(ProgressReporter.scala:425) at org.apache.spark.sql.execution.streaming.StreamExecution.reportTimeTaken(StreamExecution.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runActivatedStream$1(MicroBatchExecution.scala:249) at org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:67) at org.apache.spark.sql.execution.streaming.MicroBatchExecution.runActivatedStream(MicroBatchExecution.scala:239) at org.apache.spark.sql.execution.streaming.StreamExecution.$anonfun$runStream$1(StreamExecution.scala:311) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:900) at org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:289) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.$anonfun$run$1(StreamExecution.scala:211) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.JobArtifactSet$.withActiveJobArtifactState(JobArtifactSet.scala:94) at org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:211) Caused by: org.apache.hudi.exception.HoodieException: Got runtime exception when hive syncing i_recommendations_widget_shown_snapshot_nrt at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:170) at org.apache.hudi.sync.common.util.SyncUtilHelpers.runHoodieMetaSync(SyncUtilHelpers.java:79) ... 71 more Caused by: java.lang.NullPointerException at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$null$5(TimelineUtils.java:114) at java.base/java.util.HashMap.forEach(HashMap.java:1337) at org.apache.hudi.common.table.timeline.TimelineUtils.lambda$getDroppedPartitions$6(TimelineUtils.java:113) at java.base/java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1655) at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658) at org.apache.hudi.common.table.timeline.TimelineUtils.getDroppedPartitions(TimelineUtils.java:110) at org.apache.hudi.sync.common.HoodieSyncClient.getDroppedPartitionsSince(HoodieSyncClient.java:97) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:289) at org.apache.hudi.hive.HiveSyncTool.doSync(HiveSyncTool.java:179) at org.apache.hudi.hive.HiveSyncTool.syncHoodieTable(HiveSyncTool.java:167) ... 72 more

sushant-searce avatar Oct 31 '24 05:10 sushant-searce

Hello Team,

HUDI 15 We have performed some troubleshooting steps and tries with different hoodie properties. sharing test cases with you..

Test Case 1 Pipeline Run Status : Success

hoodie.clean.automatic: false 
hoodie.clean.async: false 
hoodie.datasource.hive_sync.enable: false 

Test Case 2 Pipeline Run Status : Fail [ Table is different than case #1 ]

hoodie.clean.automatic: false 
hoodie.clean.async: false 
hoodie.datasource.hive_sync.enable: true 

Test Case 3 Pipeline Run Status : Success [ Table is different than case #1 ]

hoodie.clean.automatic: true 
hoodie.clean.async: false  
hoodie.datasource.hive_sync.enable: false

Test Case 4 Pipeline Run Status : Success [ Table same as #1 ]

hoodie.clean.automatic: true 
hoodie.clean.async: false  
hoodie.datasource.hive_sync.enable: true

sushant-searce avatar Nov 01 '24 06:11 sushant-searce

As you can see the test cases i have shared above Pipeline was working yesterday after disabling and then enabling the hive_sync.

Yesterday we disabled hive_sync in the pipeline and it ran successfully and enabled it again in next and that run successfully

but IN TODAYs run it failed with same error

It is very concerning is there anything we are missing here

Sharinf Hoodie Options as well

hudiOptions:

hoodie.cleaner.commits.retained: 10
hoodie.metadata.keep.max.commits: 30
hoodie.metadata.clean.async: false
hoodie.keep.max.commits: 30
hoodie.metadata.keep.min.commits: 20
hoodie.archive.async: false
hoodie.clean.automatic: true 
hoodie.finalize.write.parallelism: 200 
hoodie.fail.on.timeline.archiving: false 
hoodie.clean.async: false 
hoodie.parquet.max.file.size: 128000000 
hoodie.datasource.hive_sync.support_timestamp : true
#DISABLING METADATA TO REDUCE FREQUENT CALLS TO GCS 
hoodie.metadata.enable: false 
hoodie.datasource.write.hive_style_partitioning : true 
hoodie.parquet.small.file.limit: 100000000 
hoodie.datasource.hive_sync.enable: true 
hoodie.bulkinsert.shuffle.parallelism: 200 
hoodie.keep.min.commits: 11
hoodie.datasource.meta.sync.enable: true 
hoodie.metadata.cleaner.commits.retained: 3 
hoodie.cleaner.incremental.mode: true
hoodie.commits.archival.batch: 12
hoodie.upsert.shuffle.parallelism: 200 
hive_sync.support_timestamp: true 
hoodie.insert.shuffle.parallelism: 200 
hoodie.metadata.compact.max.delta.commits: 10
compaction.delta_commits: 5
metadata.compaction.delta_commits: 10
hoodie.compact.inline.max.delta.commits: 5
hoodie.archive.automatic: true
hoodie.cleaner.parallelism: 200

sushant-searce avatar Nov 01 '24 06:11 sushant-searce

@sushant-searce Is there any reasons why are you settinghoodie.fail.on.timeline.archiving to false? Also disabling metadata table will ideally increase number of GCS calls as it has to rely on listing operations.

Although it still look concerning as its null pointer exception on org.apache.hudi.common.table.timeline.TimelineUtils.getDroppedPartitions

Are you using only upsert or any other operations also on this table?

ad1happy2go avatar Nov 07 '24 11:11 ad1happy2go

Issue#2 is cleaner failure while finalizing write. Scenario is async cleaning is enabled and then cleaning went inflight, and when we come to finalizing the write and reconcile against markers, the cleaner fails because the reconciliation is looking up the clean instant itself.

codope avatar Mar 27 '25 18:03 codope