grpc-node ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}]

Problem description

We experience intermittent connection reset errors migrating to grpc-js version "1.3.2". We are creating the Grpc client instance every time when message is processed. On the client side we see the stack trace ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] and no log messages on the server side although GRPC_TRACE=all and GRPC_VERBOSITY=DEBUG on server side as well. We have attached the debug logs seen on the client side. This issue is seen in our production k8s environment and it is intermittent.

Environment

Node version 16.9.0
Package name and version @grpc/grpc-js: 1.3.2 and @grpc/proto-loader: 0.6.2

Additional context

Client Logging:

2021-09-19T16:53:33.253Z | call_stream | [79] write() called with message of length 34
2021-09-19T16:53:33.255Z | call_stream | [79] end() called
2021-09-19T16:53:33.258Z | call_stream | [79] deferring writing data chunk of length 39
2021-09-19T16:53:33.263Z | dns_resolver | Resolved addresses for target dns:x-des:50051: [x.x.x.x:50051]
2021-09-19T16:53:33.264Z | pick_first | Connect to address list x.x.x.x:50051
2021-09-19T16:53:33.264Z | subchannel_refcount | x.x.x.x:50051 refcount 4 -> 5
2021-09-19T16:53:33.264Z | pick_first | Pick subchannel with address x.x.x.x:50051
2021-09-19T16:53:33.264Z | pick_first | IDLE -> READY
2021-09-19T16:53:33.264Z | resolving_load_balancer | dns:x-des:50051 CONNECTING -> READY
2021-09-19T16:53:33.264Z | channel | callRefTimer.unref | configSelectionQueue.length=1 pickQueue.length=0
2021-09-19T16:53:33.265Z | connectivity_state | dns:x-des:50051 CONNECTING -> READY
2021-09-19T16:53:33.265Z | subchannel_refcount | x.x.x.x:50051 refcount 5 -> 6
2021-09-19T16:53:33.265Z | subchannel_refcount | x.x.x.x:50051 refcount 6 -> 5
2021-09-19T16:53:33.265Z | channel | Pick result: COMPLETE subchannel: x.x.x.x:50051 status: undefined undefined
2021-09-19T16:53:33.266Z | call_stream | Starting stream on subchannel x.x.x.x:50051 with headers
                grpc-timeout: 99946m
                grpc-accept-encoding: identity,deflate,gzip
                accept-encoding: identity
                :authority: x-des:50051
                user-agent: grpc-node-js/1.3.7
                content-type: application/grpc
                :method: POST
                :path: /io.xingular.device.Service/Read
                te: trailers

2021-09-19T16:53:33.266Z | call_stream | [79] attachHttp2Stream from subchannel x.x.x.x:50051
2021-09-19T16:53:33.266Z | subchannel_refcount | x.x.x.x:50051 callRefcount 0 -> 1
2021-09-19T16:53:33.266Z | call_stream | [79] sending data chunk of length 39 (deferred)
2021-09-19T16:53:33.266Z | call_stream | [79] calling end() on HTTP/2 stream
2021-09-19T16:53:33.273Z | call_stream | [79] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read
2021-09-19T16:53:33.273Z | subchannel | x.x.x.x:50051 connection closed with error read ECONNRESET
2021-09-19T16:53:33.274Z | subchannel | x.x.x.x:50051 connection closed
2021-09-19T16:53:33.274Z | subchannel | x.x.x.x:50051 READY -> IDLE
2021-09-19T16:53:33.274Z | subchannel_refcount | x.x.x.x:50051 refcount 5 -> 4
2021-09-19T16:53:33.274Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.275Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | subchannel_refcount | x.x.x.x:50051 refcount 4 -> 3
2021-09-19T16:53:33.275Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.275Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.275Z | subchannel_refcount | x.x.x.x:50051 refcount 3 -> 2
2021-09-19T16:53:33.276Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.276Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.276Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.276Z | subchannel_refcount | x.x.x.x:50051 refcount 2 -> 1
2021-09-19T16:53:33.276Z | pick_first | READY -> IDLE
2021-09-19T16:53:33.276Z | resolving_load_balancer | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.276Z | connectivity_state | dns:x-des:50051 READY -> IDLE
2021-09-19T16:53:33.277Z | call_stream | [79] HTTP/2 stream closed with code 2
2021-09-19T16:53:33.277Z | call_stream | [79] ended with status: code=14 details="read ECONNRESET"
2021-09-19T16:53:33.278Z | subchannel_refcount | x.x.x.x:50051 callRefcount 1 -> 0
^[[31merror^[[39m: 2021-09-19T16:53:33.278Z: Error serving unary request 14 UNAVAILABLE: read ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}]

Sep 19 '21 18:09 Arun-KumarH

Sometimes connections get dropped for reasons outside of the library's control. Just retry the call when that happens. The client will reconnect when you do.

In addition, you say "We are creating the Grpc client instance every time when message is processed." We recommend not doing that. You can just use one client for all of your requests. Using multiple clients isn't even accomplishing anything for you right now. They're all using the same underlying connection anyway.

Sep 20 '21 15:09 murgatroid99

Try to downgrade to Node version 16.8.0 and see if that helps. We tested Node version 16.9.0 and 16.9.1 with no luck. Downgrading to Node version 16.8.0 works without any problems.

Sep 21 '21 12:09 gillsoftab

@gillsoftab what exact issues did you have with recent node versions?

Sep 21 '21 12:09 vanthome

Node version 16.9.0 and 16.9.1 same as described in https://github.com/nodejs/node/issues/39683 we also detected issues with an app that is using template (node add-on in c++).

Sep 21 '21 12:09 gillsoftab

Sometimes connections get dropped for reasons outside of the library's control. Just retry the call when that happens. The client will reconnect when you do.

In addition, you say "We are creating the Grpc client instance every time when message is processed." We recommend not doing that. You can just use one client for all of your requests. Using multiple clients isn't even accomplishing anything for you right now. They're all using the same underlying connection anyway.

I am having the same issue as this, i am in the process of downgrading my docker containers to use node 16.8.0 as suggested. However it is probably a good idea to have retires.

could anyone let me know if there is a way to do this inside @grpc/grpc-js config options. looking a GRPC .net library i can see there is a MaxAttempts = 5 param. does anyone know if there is something similar in @grpc/grpc-js library?

all i can see in this library is "grpc.initial_reconnect_backoff_ms" under channelOptions so my assumption would be it does retries by default, however it does not seem like this is the case

Sep 24 '21 14:09 Jaypov

all i can see in this library is "grpc.initial_reconnect_backoff_ms" under channelOptions so my assumption would be it does retries by default, however it does not seem like this is the case

Connections are separate things for requests. This is one of the options that controls how the channel re-establishes connections after they are dropped. The @grpc/grpc-js library does not currently have any built-in support for doing requests automatically, so you will need to do that yourself. For example, you could use an interceptor, as described in the interceptor spec's examples section.

Sep 24 '21 16:09 murgatroid99

Thanks for the info and the link. Just an FYI to the rest of thee thread, downgrading node to 16.8.0 did not resolve the ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] issue for me. It is still happening sporadically. What i am finding odd is when developing locally, i never get this error (i develop locally with with node 16.8.0). When i deploy my services with docker and docker swarm, this is when i get the issue. I have 9 services and i only get this issue with 2 of them, i am yet to see it happen with any of the others

i am using@grpc/grpc-js: 1.3.7

Sep 26 '21 19:09 Jaypov

I am using node 16.8.0 and i have downgrade too

    "@grpc/grpc-js": "1.2.1",
    "@grpc/proto-loader": "0.5.1",

and now the ECONNRESET [{"message":"14 UNAVAILABLE: read ECONNRESET"}] has seemed to have disappear for me.

Sep 27 '21 22:09 Jaypov

One of the 1.3.x releases changed how ECONNRESET errors were reported. In 1.2.x, they had the error code INTERNAL instead of UNAVAILABLE. Are you getting any INTERNAL errors with your new setup?

Sep 27 '21 22:09 murgatroid99

Ah damn, yes i am, it has just happend sporadically again. do you think it could be an issue with NestJs (the framework i use). It uses grpc-js under the hood. https://docs.nestjs.com/microservices/grpc

I manage the implement a retry interceptor as shown HERE However it hangs. The if statement condition will resolve to true and retry will be called, but then it will just hang. no response from onReceiveMessage or onReceiveStatus

          if (response.code !== status.OK) {
            console.log('CONNECTION FAILED: retry... ' + retries);
            retry(savedSendMessage, savedMetadata);
          } else {
            console.log('ALL GOOD');
            savedMessageNext(savedReceiveMessage);
            next(response);
          }

At this point, there seems to be no quick fix solution. not sure if it is a library issue or a much deeper issue.

Sep 28 '21 11:09 Jaypov

ECONNRESET is a relatively low-level network error. It's unlikely that it's caused by the framework.

You might want to try this version of the retry interceptor, which was actually tested. Just remove the code that references the registry, because that's there for the test.

Sep 28 '21 16:09 murgatroid99

That makes sense, thanks i'll give it a try. Thanks for all the help, appreciate it, been pulling my hair out.

Sep 28 '21 20:09 Jaypov

Hey, so i have been poking around a lot today and whilst i do not have an answer i have some findings.

I use docker-compose to bring up all the microservice inside docker containers. When I run these containers on my local machine all the microservices work fine (never get [{"message":"14 UNAVAILABLE: read ECONNRESET"}])`

However when running them on my development network (AWS EC2 instance with all the containers on the same 1 VM) I do sporadically get the [{"message":"14 UNAVAILABLE: read ECONNRESET"}])`. It seems that I get the error randomly, then when i make a call again it will work multiple times. Then if i wait some time I will get the error again.

I enabled grpc-js debugging on the development network and i noticed this SSL error 150] Node error event: message=read ECONNRESET code=ECONNRESET errno=Unknown system error -104 syscall=read

This lead me to ssh'ing into the docker containers on my development network. and using this command line tool to try and connect to one microservice from another. https://github.com/fullstorydev/grpcurl

I found that if i send a message using -plaintext it successfully connects to the microservice every single time. If I use TLS I get first record does not look like a TLS handshake (which is an error from the command line tool).

just out of curiosity do we know if grpc-js has some sort of fallback for if TLS fails?

Sep 29 '21 17:09 Jaypov

just out of curiosity do we know if grpc-js has some sort of fallback for if TLS fails?

That's not a thing, in general. If you instruct the client to connect over TLS, it will connect over TLS, or not all.

If the problem is that the response you're getting is not a TLS handshake, are you sure that you are connecting to a TLS server? The port that is serving the plaintext service can't also be serving TLS; are you sure you are connecting to the correct TLS server port when trying to make a request using TLS?

Sep 29 '21 18:09 murgatroid99

I am unsure, the instantiation of the grpc-js client happens under the hood in the nest-js framework. ill dig into the code and take a look.

on another note, i think this issue may possibly be cause by docker swarm, at this point i am just going through a process of elimination. I have booted a new EC2 instance and installed all of my containers with just docker-compose and the error has disappeared. I am just in the process of booting up another instance, installing with docker compose and then putting a load balancer in front of the 2 instances (this is my current setup but without swarm) to see if the issues comes back.

Sep 30 '21 11:09 Jaypov

The questions I asked are not even necessarily about grpc-js. You said

This lead me to ssh'ing into the docker containers on my development network. and using this command line tool to try and connect to one microservice from another. https://github.com/fullstorydev/grpcurl

I found that if i send a message using -plaintext it successfully connects to the microservice every single time. If I use TLS I get first record does not look like a TLS handshake (which is an error from the command line tool).

What ports did you connect to for these two tests? On the server you connected to, what port is serving in plaintext and what port is serving in TLS?

Sep 30 '21 14:09 murgatroid99

Hey @Jaypov, I just want to know if you were able to fix it with that change in docker.

Thanks!

Oct 27 '21 18:10 brunor2

We also face the same issue. Sparingly getting the "message":"14 UNAVAILABLE: read ECONNRESET".

some observations by me : Client Env: Nodejs version:14.16.1 @grpc/grpc-js: 1.3.8 and @grpc/proto-loader: 0.5.6

in some clients: Nodejs version:14.16.1 @grpc/grpc-js: 1.4.4 and @grpc/proto-loader: 0.5.6

Server Env: Nodejs version:14.16.1 @grpc/grpc-js: 1.4.4 and @grpc/proto-loader: 0.5.6

experience the "message":"14 UNAVAILABLE: read ECONNRESET". issue sparingly.

But in one microservice client: the client we use nestjs : Nodejs version:14.16.1 +-- @grpc/[email protected] +-- [email protected]

Server: Nodejs version:14.16.1 @grpc/grpc-js: 1.4.4 and @grpc/proto-loader: 0.5.6

never seen read ECONNRESET once in the logs.

Nov 21 '21 07:11 Siddhesh-Swami

migrated all nodejs services to: @grpc/grpc-js: 1.2.4 and @grpc/proto-loader: 0.5.6 but still face this issue

Dec 06 '21 04:12 Siddhesh-Swami

To new commenters, see my first response on this issue:

Sometimes connections get dropped for reasons outside of the library's control. Just retry the call when that happens. The client will reconnect when you do.

Dec 06 '21 14:12 murgatroid99

We have same problem.

@murgatroid99 Do you have plan to implement interceptor (or something different) in next releases to grpc-js?

Dec 08 '21 17:12 tomasholicky

The client interceptor API has been available in grpc-js since before the 1.0 release.

Dec 08 '21 18:12 murgatroid99

Hello, opened a new issue for read ECONNRESET. #1994

Dec 22 '21 12:12 Siddhesh-Swami

#2115 we are using the retry interceptor and it works for Unary requests, but for client streaming after the first chunk is sent the connection is closed and we receive back grpc response. We followed this version of interceptor

May 09 '22 08:05 Arun-KumarH

#2115 This is a pressing issue for us, any updates how we can use the retry interceptor for streaming use cases ?

Jun 02 '22 07:06 Arun-KumarH

Still no answer ?

Jun 21 '22 15:06 bastienlemaitre

Still no answer ?

Sep 28 '22 18:09 railsonluna

Still no answer? We are also having this issue, its quite frustrating because we cant replicate it

Sep 28 '22 18:09 felipe-huszar

I'm running into the same issue with streaming requests, but I have additional info that might help. I'm specifically using the retry interceptor which closely matches the one from the tests in implementation.

I see the following traceback for streaming requests:

Trace: Request path: /object_attributes.object_attributes_grpc/ListAssortmentObjectAttributes; response status code: 0; not eligible for retry.
    at Object.onReceiveStatus (/workspace/atlas/common/lib/client/interceptors.ts:135:41)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:206:19)
    at Object.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)
    at /workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:334:24
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
savedMessageNext: undefined
TypeError: savedMessageNext is not a function
    at Object.onReceiveStatus (/workspace/atlas/common/lib/client/interceptors.ts:141:33)
    at InterceptingListenerImpl.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:206:19)
    at Object.onReceiveStatus (/workspace/atlas/node_modules/@grpc/grpc-js/src/client-interceptors.ts:424:48)
    at /workspace/atlas/node_modules/@grpc/grpc-js/src/call-stream.ts:334:24
    at processTicksAndRejections (node:internal/process/task_queues:78:11)
[ERROR] 10:16:28 TypeError: savedMessageNext is not a function

Looking at InterceptingListenerImpl, it seems that a next method is not passed to nextListener which would result in savedMessageNext being undefined.

I had to link to the 1.7 branch above. I was surprised to see call-stream.ts is not on master anymore, and then I noticed work on the retry implementation by @murgatroid99. I would gladly try this out on our stack once it is ready and thanks for the great work!

Nov 27 '22 10:11 rochecompaan

This works for us:

const channelOptions: ChannelOptions = {
  // Send keepalive pings every 10 seconds, default is 2 hours.
  'grpc.keepalive_time_ms': 10 * 1000,
  // Keepalive ping timeout after 5 seconds, default is 20 seconds.
  'grpc.keepalive_timeout_ms': 5 * 1000,
  // Allow keepalive pings when there are no gRPC calls.
  'grpc.keepalive_permit_without_calls': 1,
};

✌️

Jan 23 '23 01:01 HofmannZ