Logging stops after google.logging.v2.LoggingServiceV2 timeout
Environment details
- OS: Ubuntu 20, App Engine
- Node.js version: 20.9.0
-
@google-cloud/loggingversion: 10.5.0 -
@grpc/grpc-jsversion: 1.8.22 -
google-gaxversion: 3.6.1
Steps to reproduce
- Run server on Google App Engine
- After some number of hours, we see an error in the logging of:
GoogleError: Total timeout of API google.logging.v2.LoggingServiceV2 exceeded 60000 milliseconds before any response was received.
at repeat (/app/node_modules/google-gax/src/normalCalls/retries.ts:84:23)
at Timeout._onTimeout (/app/node_modules/google-gax/src/normalCalls/retries.ts:125:13)
at listOnTimeout (node:internal/timers:573:17)
at processTimers (node:internal/timers:514:7) {
From this point onwards, no logs from our app are written and visible in the Log Explorer. Notably, followed the steps from https://cloud.google.com/logging/docs/agent/logging/troubleshooting and it seems like fluentd is still alive, as we are able to see traces of later requests in the Log Explorer but no app logging from them.
This seems like the same issue as 617, though in that case upgrading the package versions helped. I have included all the versions I am using at the top.
Is there some way to catch this error and retry logging overall? Seems odd that the logging service dies completely after one failed request.
We are running into the same issue, was there any resolution to this?
This is the most annoying issue to deal with. We see this with "@google-cloud/logging-winston": "^6.0.1", and have spent weeks trying to find a reasonable solution/workaround to the issue. None of the "recommended" solutions worked, ranging from trying to increase the timeout using gaxOpts or patching loggingServiceV2, to using defaultCallback and process.onError to catch and handle gracefully, configure fallback console transport, etc... The result is the same: Logging timeouts occur periodically, causing nodejs crash/restart in CloudRun, and/or lcak of logging output as per earlier messages in this thread.
We looked for use patterns but cannot trace this error to any increased activity within out application - it is exclusively service degradation on GCP end. Now, I've seen this reported as early as 2020/2021, and it still occurs 4 years later... Frankly, its ridiculous that a logging API/library can bring down the server, and is not being addressed by GCP team for years.