Beam Core IO: WriteFiles stuck
In our pipeline, workers are sometimes stuck at step WriteShardsIntoTempFiles. We believe this happens when GCS fails to respond for more than 10 minutes and Google's VPC cuts the connection. The Java code will just hang forever because it's waiting on FIN or RST from the peer.
Environment: Java 8, Beam 2.34/2.35, Dataflow runner.
See below for relevant information from logs, threads, netstat.
Log snippet:
Operation ongoing in step Write Events To File/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles
for at least 09h15m00s without outputting or completing in state process
at org.conscrypt.NativeCrypto.SSL_do_handshake(Native
Method)
at org.conscrypt.NativeSsl.doHandshake(NativeSsl.java:392)
at org.conscrypt.ConscryptFileDescriptorSocket.startHandshake(ConscryptFileDescriptorSocket.java:225)
at org.conscrypt.ConscryptFileDescriptorSocket.waitForHandshake(ConscryptFileDescriptorSocket.java:474)
at org.conscrypt.ConscryptFileDescriptorSocket.getOutputStream(ConscryptFileDescriptorSocket.java:461)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
at
com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:148)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1955)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1857)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1926)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:374)
at
org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:476)
at org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:446)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:127)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:60)
at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:239)
at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:226)
at org.apache.beam.sdk.io.FileBasedSink$Writer.open(FileBasedSink.java:935)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:768)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown
Source)
Operation ongoing in step Write Events To File/WriteFiles/WriteShardedBundlesToTempFiles/WriteShardsIntoTempFiles
for at least 04h55m00s without outputting or completing in state process
at sun.misc.Unsafe.park(Native
Method)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
at java.util.concurrent.FutureTask.awaitDone(FutureTask.java:429)
at java.util.concurrent.FutureTask.get(FutureTask.java:191)
at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.waitForCompletionAndThrowIfUploadFailed(BaseAbstractGoogleAsyncWriteChannel.java:247)
at com.google.cloud.hadoop.util.BaseAbstractGoogleAsyncWriteChannel.close(BaseAbstractGoogleAsyncWriteChannel.java:168)
at org.apache.beam.sdk.io.FileBasedSink$Writer.close(FileBasedSink.java:1029)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:932)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown
Source)
Threads snippet:
"Thread-140" - Thread t@204
java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native
Method)
at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
at java.net.SocketInputStream.read(SocketInputStream.java:171)
at java.net.SocketInputStream.read(SocketInputStream.java:141)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.readFromSocket(ConscryptEngineSocket.java:920)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.processDataFromSocket(ConscryptEngineSocket.java:884)
at org.conscrypt.ConscryptEngineSocket$SSLInputStream.access$100(ConscryptEngineSocket.java:706)
at org.conscrypt.ConscryptEngineSocket.doHandshake(ConscryptEngineSocket.java:230)
at
org.conscrypt.ConscryptEngineSocket.startHandshake(ConscryptEngineSocket.java:209)
- locked
<27a37219> (a java.lang.Object)
at org.conscrypt.ConscryptEngineSocket.waitForHandshake(ConscryptEngineSocket.java:547)
at org.conscrypt.ConscryptEngineSocket.getOutputStream(ConscryptEngineSocket.java:290)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:465)
at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
- locked <1edf4d14> (a sun.net.www.protocol.https.HttpsClient)
at sun.net.www.protocol.https.HttpsClient.<init>(HttpsClient.java:264)
at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:367)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1156)
at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1050)
at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:162)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:148)
at com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:84)
at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1012)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:514)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:455)
at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:565)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getObject(GoogleCloudStorageImpl.java:1970)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getItemInfo(GoogleCloudStorageImpl.java:1864)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.getWriteGeneration(GoogleCloudStorageImpl.java:1938)
at com.google.cloud.hadoop.gcsio.GoogleCloudStorageImpl.create(GoogleCloudStorageImpl.java:411)
at org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:490)
at org.apache.beam.sdk.extensions.gcp.util.GcsUtil.create(GcsUtil.java:471)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:144)
at org.apache.beam.sdk.extensions.gcp.storage.GcsFileSystem.create(GcsFileSystem.java:62)
at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:243)
at org.apache.beam.sdk.io.FileSystems.create(FileSystems.java:230)
at org.apache.beam.sdk.io.FileBasedSink$Writer.open(FileBasedSink.java:951)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn.processElement(WriteFiles.java:921)
at org.apache.beam.sdk.io.WriteFiles$WriteShardsIntoTempFilesFn$DoFnInvoker.invokeProcessElement(Unknown
Source)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:232)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SimpleDoFnRunner.processElement(SimpleDoFnRunner.java:188)
at org.apache.beam.runners.dataflow.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:339)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn$1.output(GroupAlsoByWindowsParDoFn.java:185)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner$1.outputWindowedValue(GroupAlsoByWindowFnRunner.java:108)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.lambda$onTrigger$1(ReduceFnRunner.java:1058)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner$$Lambda$1284/1132939800.output(Unknown
Source)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnContextFactory$OnTriggerContextImpl.output(ReduceFnContextFactory.java:445)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.SystemReduceFn.onTrigger(SystemReduceFn.java:130)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTrigger(ReduceFnRunner.java:1061)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.ReduceFnRunner.onTimers(ReduceFnRunner.java:771)
at org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:97)
at org.apache.beam.runners.dataflow.worker.StreamingGroupAlsoByWindowViaWindowSetFn.processElement(StreamingGroupAlsoByWindowViaWindowSetFn.java:43)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.invokeProcessElement(GroupAlsoByWindowFnRunner.java:121)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowFnRunner.processElement(GroupAlsoByWindowFnRunner.java:73)
at org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:80)
at org.apache.beam.runners.dataflow.worker.GroupAlsoByWindowsParDoFn.processElement(GroupAlsoByWindowsParDoFn.java:137)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ParDoOperation.process(ParDoOperation.java:44)
at org.apache.beam.runners.dataflow.worker.util.common.worker.OutputReceiver.process(OutputReceiver.java:49)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:212)
at org.apache.beam.runners.dataflow.worker.util.common.worker.ReadOperation.start(ReadOperation.java:163)
at org.apache.beam.runners.dataflow.worker.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:92)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:1437)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker.access$1100(StreamingDataflowWorker.java:165)
at org.apache.beam.runners.dataflow.worker.StreamingDataflowWorker$7.run(StreamingDataflowWorker.java:1113)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Locked ownable synchronizers:
- locked <4b11862a> (a java.util.concurrent.ThreadPoolExecutor$Worker
Netstat snippet:
tcp 1 0 our-pipeline:34942 internal-ip:443 CLOSE_WAIT
Imported from Jira BEAM-14067. Original Jira may contain additional context. Reported by: yianni.
I believe this is https://github.com/google/conscrypt/issues/864 and will be addressed once https://github.com/GoogleCloudDataproc/hadoop-connectors/pull/865 is in a 2.2.x release and beam depedencies are so google_cloud_bigdataoss_version includes this version.
In the meantime this can be avoided by disabling conscrypt
The fix should be in the 2.2.8 hadoop-connectors release, no ETA yet
.take-issue
Hello @scwhittle . We have released GCS connector 2.2.8 containing the fix for Conscrypt socket hangs.
https://mvnrepository.com/artifact/com.google.cloud.bigdataoss/gcs-connector/hadoop3-2.2.8
Thanks Chris! https://github.com/apache/beam/pull/23300 to update Beam
.reopen-issue
Reopening as #23300 is being rolled back due to #23588
Workarounds in the meantime are to:
- not enable the conscrypt security provider via enable_conscrypt_security_provider experiment
- update the bigdataoss:gcsio library used manually to 2.2.8 (and not use use_grpc_for_gcs experiment)
The reason for rollbacks here (use_grpc_for_gcs experiment) broke for other reasons, and I'm attempting an upgrade now to 2.2.16: https://github.com/apache/beam/pull/27679
This should be fixed by upgrade to 2.2.16 or higher above