Node randomly exits even if Google PubSub subscription is opened
Problem description
Node sample app running behind a proxy randomly exits without any close/error event
Reproduction steps
We have a Node app which uses Google PubSub module (2.18.5). Few amount of time per day, we stop receiving messages. We never encounter subscription onClose or onError events. In order to troubleshoot, we wrote a small code to try to reproduce the behavior. We have a similar behavior. Few amount of time per day, the app exits properly without passing through onClose/onError subscription functions. The difference of behavior between our app and the sample code is our app has more stuff to do. That's why it just stopped receiving messages instead of exiting. We already opened a case to Google PubSub team. They never see disconnections on their side. Our issue is probably a network issue due to the proxy but our concern is we never trigger any onClose/onError events. So, we can't properly catch them. We verified that onMessage,onClose and onError subscription listeners are still present during process onExit event. We don't understand why Node exits in our sample code or why we stop receiving message in our app as the subscription streams connections seem to have been re-established. This behavior is completely random.
Environment
- OS name, version and architecture: [e.g. Linux Ubuntu 18.04 amd64] node docker image node:16.14.0-alpine3.15 amd64 running on Kubernetes cluster 1.20
- Node version [e.g. 8.10.0] 16.14.0 LTS
- Node installation method [e.g. nvm] docker image
- If applicable, compiler version [e.g. clang 3.8.0-2ubuntu4]
- Package name and version [e.g. [email protected]] @grpc/[email protected]
Additional context
Some interesting logs
call_stream | [705] cancelWithStatus code: 4 details: "Deadline exceeded"
call_stream | [705] close http2 stream with code 8
call_stream | [705] HTTP/2 stream closed with code 8
call_stream | [708] ended with status: code=14 details="Connection dropped"
full log: elastic_result.log code sample: index.js.txt
FYi, we wrote a similar code in Go. We don't have this behavior at all even if the container runs on the same Kubernetes node and uses the same proxy
I have published version grpc-js version 1.5.6 with some more logging. Can you try updating and running your test again, and share the resulting log, or at least the end of it?
Yep. Sure. Thank you.
Hello @murgatroid99 please find attached logs with 1.5.6. Thank you elastic_result_2022.02.24.log
OK, I published version 1.5.7 with a change that I think will address the problem. Can you try it out?
Sure. Thank you
Hello @murgatroid99 No more "Connection dropped" logs but the behavior is the same: Node.js exits. Subscription onClose/onError callback functions are never called. Please find attached 1.5.7 logs Thank you elastic_result_2022.02.24_02.log elastic_result_2022.02.25_01.log
Can you share the whole log and not just the very end?
What that log shows is much worse than I expected. The test ran for over two hours, and no request received any data from the server after 8 minutes into the test.
I see that you are using a proxy. Does that proxy log traffic, and if so, can you share the traffic log for the duration of the test?
Last time I requested proxy logs, I only get a log with CONNECT log entries. Not very useful :(
As the issue is completely random, it's a bit complicated to get logs during the issue
I hope I have a new occurence during the working day as we don't receive lot of PubSub messages during out of office.
Please find attached the full log of the last round. the one of the 2nd very end sent this morning. I guess it's similar as the previous one.
Something I didn't pay attention before preparing these full logs. The connection has been recovered multiple times without having node.js exiting. But few times per day (only twice during last 24h), node.js considers to have nothing more to do and exits.
That log looks very similar to the previous one. The client does reconnect multiple times, but as I said, it does not receive any stream data from the server after the very beginning of the log. I think it is likely that that has the same root cause as the abrupt exits, but I don't know what that is yet.
Hello @murgatroid99 please find attached a log during office hours (more PubSub messages) Thank you elastic_result_2022.03.01.log.gz
That is useful. In that log the client receives messages throughout the test, but it still experiences the same failure. I think that indicates that what I noticed earlier about the client not receiving messages was not actually related to the main problem here.
I did manage to get the provided test code working, but it ran for several hours without reproducing the problem. I think the problem may be caused by using a proxy, but I have not been able to test that yet.
Yes. It's definitely something with the proxy which creates random strange behavior with the client. As discussed, the problem is we are not able to catch this behavior :(
FWIW, I am also experiencing this issue. The kubernetes container listening on the pubsub subscription randomly restarts with exit code 0 several times a day. The close and error event handlers are not run.
Hello @levidyrek I'm sorry to hear that but I'm "happy" to know I'm not the only one to have the issue 😄
@WeetA34 did #2052 (v1.5.7+) fix the problem for you? I believe we are facing the same issue while running v1.3.8 on an AWS ECS cluster.
It is likely that this issue is obsolete due to significant changes that have been made to relevant parts of the code in more recent versions. Please file a new issue for any occurrences of the same problem with a more recent version of grpc-js.
@WeetA34 did #2052 (v1.5.7+) fix the problem for you? I believe we are facing the tame issue while running v1.3.8 on an AWS ECS cluster.
Sorry, i didn’t see the notification for your message in March. At the end, we bypassed the proxy. It fixed our issue. We never found the root cause. If i have time, i test again with latest version and proxy. Thank you