[SUPPORT] Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error
We are trying to create a COW table using kafka as our source and s3 as our sink. The source comprises of a list of kafka topics. The current checkpoint happens every 2 mins and when the checkpoint starts (hudi table files commit), the flink job throws an exception
2024-06-17 23:43:03
org.apache.flink.util.FlinkException: Global failure triggered by OperatorCoordinator for 'hoodie_append_write: <database_name>.<table_name>' (operator fa7c267a06b83c5e2dc3af13367ebe76).
at org.apache.flink.runtime.operators.coordination.OperatorCoordinatorHolder$LazyInitializedCoordinatorContext.failJob(OperatorCoordinatorHolder.java:556)
at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$start$0(StreamWriteOperatorCoordinator.java:196)
at org.apache.hudi.sink.utils.NonThrownExecutor.handleException(NonThrownExecutor.java:142)
at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:133)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Caused by: org.apache.hudi.exception.HoodieException: Executor executes action [commits the instant 20240618064120870] error
... 6 more
Caused by: org.apache.hudi.exception.HoodieException: Commit instant [20240618064120870] failed and rolled back !
at org.apache.hudi.sink.StreamWriteOperatorCoordinator.doCommit(StreamWriteOperatorCoordinator.java:587)
at org.apache.hudi.sink.StreamWriteOperatorCoordinator.commitInstant(StreamWriteOperatorCoordinator.java:542)
at org.apache.hudi.sink.StreamWriteOperatorCoordinator.lambda$notifyCheckpointComplete$2(StreamWriteOperatorCoordinator.java:258)
at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130)
... 3 more
This pipeline is the only one writing to the table so we dont have multiple writers.
Below is the config used for this
options.put(FlinkOptions.PATH.key(), hudiBasePath);
options.put(FlinkOptions.TABLE_NAME.key(), targetTable);
options.put(FlinkOptions.TABLE_TYPE.key(), HoodieTableType.COPY_ON_WRITE.name());
options.put(FlinkOptions.PRECOMBINE_FIELD.key(), "timestamp"); // need to fix this as last_Update is null
options.put(FlinkOptions.IGNORE_FAILED.key(), "false");
options.put(HoodieIndexConfig.INDEX_TYPE.key(), HoodieIndex.IndexType.GLOBAL_BLOOM.name());
options.put(FlinkOptions.OPERATION.key(), WriteOperationType.INSERT.value());
options.put(FlinkOptions.PARTITION_PATH_FIELD.key(), "ts_date:string");
options.put("hoodie.parquet.small.file.limit", "104857600");
options.put("hoodie.parquet.max.file.size", "536870912");
options.put("clustering.schedule.enabled", "true");
options.put("clustering.async.enabled", "true");
options.put("hoodie.clustering.plan.strategy.max.bytes.per.group", "107374182400");
//options.put("hoodie.clustering.plan.strategy.max.num.groups", "1");
options.put("write.tasks", "1");
Environment Description
-
Hudi version : 0.14.1
-
Flink version : 1.15.2
-
Storage (HDFS/S3/GCS..) : s3
Based on some GH history, we did try to delete the .aux/ckp_meta/ dir, but still no luck.
Any pointers on how we go about fixing this would be much appreciated.
I see you put the option options.put(FlinkOptions.IGNORE_FAILED.key(), "false");, it looks like there is error for parquet writers which is collected back to the coordinator, so it reports error when committing.
Hmm. I dint find any relevant errors in the tm logs.
Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further
Also get this exception in the same job
22:51:20.093 [pool-286-thread-1] ERROR org.apache.hudi.common.table.view.PriorityBasedFileSystemView - Got error running preferred function. Trying secondary
org.apache.hudi.exception.HoodieRemoteException: Connect to 10.3.175.136:45105 [/10.3.175.136] failed: Connection refused (Connection refused)
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:547) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.execute(PriorityBasedFileSystemView.java:69) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.PriorityBasedFileSystemView.getPendingCompactionOperations(PriorityBasedFileSystemView.java:257) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.table.action.clean.CleanPlanner.<init>(CleanPlanner.java:98) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.requestClean(CleanPlanActionExecutor.java:159) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.table.action.clean.CleanPlanActionExecutor.execute(CleanPlanActionExecutor.java:185) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.table.HoodieFlinkCopyOnWriteTable.scheduleCleaning(HoodieFlinkCopyOnWriteTable.java:359) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.client.BaseHoodieTableServiceClient.scheduleTableServiceInternal(BaseHoodieTableServiceClient.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.client.BaseHoodieTableServiceClient.clean(BaseHoodieTableServiceClient.java:752) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:862) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.client.BaseHoodieWriteClient.clean(BaseHoodieWriteClient.java:835) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.sink.CleanFunction.lambda$open$0(CleanFunction.java:71) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.sink.utils.NonThrownExecutor.lambda$wrapAction$0(NonThrownExecutor.java:130) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: org.apache.hudi.org.apache.http.conn.HttpHostConnectException: Connect to 10.3.175.136:45105 [/10.3.175.136] failed: Connection refused (Connection refused)
at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:543) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
... 16 more
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method) ~[?:?]
at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) ~[?:?]
at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown Source) ~[?:?]
at java.net.AbstractPlainSocketImpl.connect(Unknown Source) ~[?:?]
at java.net.SocksSocketImpl.connect(Unknown Source) ~[?:?]
at java.net.Socket.connect(Unknown Source) ~[?:?]
at org.apache.hudi.org.apache.http.conn.socket.PlainConnectionSocketFactory.connectSocket(PlainConnectionSocketFactory.java:74) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:134) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:353) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:380) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:107) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:55) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.org.apache.http.client.fluent.Request.execute(Request.java:151) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.get(RemoteHoodieTableFileSystemView.java:629) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.executeRequest(RemoteHoodieTableFileSystemView.java:207) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
at org.apache.hudi.common.table.view.RemoteHoodieTableFileSystemView.getPendingCompactionOperations(RemoteHoodieTableFileSystemView.java:543) ~[blob_p-49658f01166115a275763af3064121d9588b4f90-c0a8066fe9067407b6e6321b83226118:0.0.1-SNAPSHOT]
... 16 more
Changed the IGNORE_KEY to true and it seems to be working but I dont see any data in the parquet files. They are all empty. Any idea how should I debug this further
The error is catched up by the writer, and when an error occurs, it logs an error msg:
LOG.error("Error writing record " + record, t);
The error is catched up by the writer, and when an error occurs, it logs an error msg:
LOG.error("Error writing record " + record, t);
I dont think so it was logging this any where. Tried running locally as well as on our k8 flink cluster.
Was finally able to resolve this and the culprit was data type mis match in one of the column. Not sure why this was not explicitly being logged in any of the logs tho
Maybe you should check your logger configuration.