hudi icon indicating copy to clipboard operation
hudi copied to clipboard

[SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error

Open ankit0811 opened this issue 1 year ago • 6 comments

We are trying to create a COW table using kafka as our source and s3 as our sink. The source comprises of a list of kafka topics. The current checkpoint happens every 2 mins and when the checkpoint starts (hudi table files commit), the flink job throws an exception

2024-06-17 23:43:03
org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'hoodie_append_write: <database_name>.<table_name>' (operator fa7c267a06b83c5e2dc3af13367ebe76).
	at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:556)
	at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$start$0(StreamWriteOperatorCoordinator.java:196)
	at org.apache.hudi.sink.utils.NonThrownExecutor.handleException(NonThrownExecutor.java:142)
	at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:133)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error
	... 6 more
Caused by: org.apache.hudi.exception.HoodieException: Commit instant [20240618064120870] failed and rolled back !
	at org.apache.hudi.sink.StreamWriteOperatorCoordinator.doCommit(StreamWriteOperatorCoordinator.java:587)
	at org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:542)
	at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:258)
	at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
	... 3 more

This pipeline is the only one writing to the table so we dont have multiple writers.

Below is the config used for this

options.put(FlinkOptions.PATH.key(), hudiBasePath);
      options.put(FlinkOptions.TABLE_NAME.key(), targetTable);
      options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.COPY_ON_WRITE.name());
      options.put(FlinkOptions.PRECOMBINE_FIELD.key(), "timestamp"); // need to fix this as last_Update is null
      options.put(FlinkOptions.IGNORE_FAILED.key(), "false");
      options.put(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.GLOBAL_BLOOM.name());
      options.put(FlinkOptions.OPERATION.key(), WriteOperationType.INSERT.value());
      options.put(FlinkOptions.PARTITION_PATH_FIELD.key(), "ts_date:string");
      options.put("hoodie.parquet.small.file.limit", "104857600");
      options.put("hoodie.parquet.max.file.size", "536870912");
      options.put("clustering.schedule.enabled", "true");
      options.put("clustering.async.enabled", "true");
      options.put("hoodie.clustering.plan.strategy.max.bytes.per.group", "107374182400");
      //options.put("hoodie.clustering.plan.strategy.max.num.groups", "1");
      options.put("write.tasks", "1");

Environment Description

  • Hudi version : 0.14.1

  • Flink version : 1.15.2

  • Storage (HDFS/S3/GCS..) : s3

Based on some GH history, we did try to delete the .aux/ckp_meta/ dir, but still no luck.

Any pointers on how we go about fixing this would be much appreciated.

ankit0811 avatar Jun 18 '24 06:06 ankit0811

I see you put the option options.put(FlinkOptions.IGNORE_FAILED.key(), "false");, it looks like there is error for parquet writers which is collected back to the coordinator, so it reports error when committing.

danny0405 avatar Jun 18 '24 09:06 danny0405

Hmm. I dint find any relevant errors in the tm logs.

Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further

ankit0811 avatar Jun 18 '24 15:06 ankit0811

Also get this exception in the same job

22:51:20.093 [pool-286-thread-1] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to 10.3.175.136:45105 [/10.3.175.136] failed: Connection refused (Connection refused)
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:547) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:69) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getPendingCompactionOperations(PriorityBasedFileSystemView.java:257) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.table.action.clean.CleanPlanner.<init>(CleanPlanner.java:98) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:159) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:185) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.table.HoodieFlinkCopyOnWriteTable.scheduleCleaning(HoodieFlinkCopyOnWriteTable.java:359) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:752) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:862) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:835) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.sink.CleanFunction.lambda$open$0(CleanFunction.java:71) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
	at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.hudi.org.apache.http.conn.HttpHostConnectException: Connect to 10.3.175.136:45105 [/10.3.175.136] failed: Connection refused (Connection refused)
	at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:543) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	... 16 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
	at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ~[?:?]
	at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ~[?:?]
	at java.net.SocksSocketImpl.connect(Unknown Source) ~[?:?]
	at java.net.Socket.connect(Unknown Source) ~[?:?]
	at org.apache.hudi.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:543) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
	... 16 more

ankit0811 avatar Jun 18 '24 22:06 ankit0811

Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further

The error is catched up by the writer, and when an error occurs, it logs an error msg:

LOG.error("Error writing record " + record, t);

danny0405 avatar Jun 19 '24 01:06 danny0405

The error is catched up by the writer, and when an error occurs, it logs an error msg:

LOG.error("Error writing record " + record, t);

I dont think so it was logging this any where. Tried running locally as well as on our k8 flink cluster.

Was finally able to resolve this and the culprit was data type mis match in one of the column. Not sure why this was not explicitly being logged in any of the logs tho

ankit0811 avatar Jun 19 '24 20:06 ankit0811

Maybe you should check your logger configuration.

danny0405 avatar Jun 20 '24 00:06 danny0405