Support request - tuning for long timeouts
My actions before raising this issue
- [ ] Followed the troubleshooting guide
- [x] Read/searched the docs
- [x] Searched past issues
Now the shutdownTimeout variable in watchdog is the same as the write_timeout environment variable.
This is unfortunate since if you set a high write_timeout value the watchdog will wait for twice that time when shutting down. This will happen even if there are no fprocesses running.
Expected Behaviour
The the watchdog shuts down as quickly as possible when there are no running fprocesses.
Possible Solution
Since the http.Server.Shudown gracefully shuts down the server without interrupting any active connections there might not be any need to have <-time.Tick(shutdownTimeout) in the listenUntilShutdown function (tested this locally with kind and didn’t experience any problems).
However, a fully backward compatible change would be to add a shutdown_timeout environment variable that would set shutdownTimeout if specified and default that back to write_timeout if not specified.
Context
We want to be able to configure a high timeout limit for the fprocess and that the watchdog shuts down as quickly as possible when there are no running fprocesses to use resources more efficiently.
Thanks for your interest.
Now the shutdownTimeout variable in watchdog is the same as the write_timeout environment variable. This is unfortunate since if you set a high write_timeout value the watchdog will wait for twice that time when shutting down. This will happen even if there are no fprocesses running.
This is the safest way we can ensure there are no in-flight requests and mirrors the of-watchdog codebase.
Out of interest, what specific writeTimeout are you setting?
/set title: Support request - tuning for long timeouts
This is a request coming from us at Cognite btw @alexellis 😊 so we target timeouts on 30 min+, and what we have observed is that it seems to finish all running processes before reaching the sleep lines.
Just following up on this, is there any more information that is needed to resolve this issue, @alexellis?
One question I have is regarding what you say here
This is the safest way we can ensure there are no in-flight requests and mirrors the of-watchdog codebase.
Why is this safer than not having the ticks there when the http server is already waiting until the active connections are terminated, as specified in the docs link above? Are you thinking about the requests that are traveling from the gateway/queue-worker to the watchdog?
I've not heard from @andeplane for a while, but I'd be happy to discuss putting time aside specifically to help Cognite with this challenge.
We have solved this before this issue was written, but I asked our intern to write it here to discuss. We are happy to provide a PR :)
We fixed this in OpenFaaS Standard in 2021.
https://www.openfaas.com/blog/long-running-jobs/
Closing as stale.
/lock: resolved 3 years ago.
/lock: resolved 3 years ago.