Add non-exiting methods and implement SIGHUP reload
Hey team! I've tried to make a few small tweaks to our server to hopefully improve its flexibility and add some reload functionality. Here's what's new:
- Added
pub fn try_bootstrap(&mut self) -> Result<bool>that doesn't call exit - Refactored
pub fn bootstrap(&mut self)to use the newtry_bootstrapmethod - Added
pub fn run_server(mut self, enable_daemon: bool) -> Result<bool>that doesn't call exit - Kept
pub fn run_forever(mut self) -> !the same, but now it callsrun_serverunder the hood - Implemented
SIGHUPcatching for reloading - Added an example server_reload to show how to reload within a tokio runtime
Related issues:
- #154
- #292
I think it is nice to break down the functions so that it is more flexible.
On the other hand, the reload in the example has a downtime (the old service shuts down before the new service starts). Do you have a plan to address that?
To me, the way of graceful reload is the following when the reload signal arrives:
- start the new service and wait for the listeners to be passed to it
- signal the old service to gracefully exit
- wait for grace_period_seconds before fully stopping/killing the old service
The current run_server() does 2) + 3) before returning for the example to do 1). So there is a window that the service appear to be offline.
I agree that the new service should be started before shutting down the old one, and I thought I had attempted to do this in the example. After conducting some simple tests, it seemed to work as expected.
However, more thorough testing might be necessary to ensure everything is functioning correctly. I'll conduct some additional tests to verify this.
I think it is nice to break down the functions so that it is more flexible.
On the other hand, the reload in the example has a downtime (the old service shuts down before the new service starts). Do you have a plan to address that?
To me, the way of graceful reload is the following when the reload signal arrives:
- start the new service and wait for the listeners to be passed to it
- signal the old service to gracefully exit
- wait for grace_period_seconds before fully stopping/killing the old service
The current
run_server()does 2) + 3) before returning for the example to do 1). So there is a window that the service appear to be offline.
After trying, I found that in the example, the actual behavior is:
- signal the old service to gracefully exit
- start the new service and wait for the listeners to be passed to it
These two steps can be considered to occur simultaneously.
If the old service doesn't start to gracefully exit, the call to bootstrap() during the upgrade will fail, preventing the new service from actually starting. It seems there wouldn't be much difference if we notify the old service to start gracefully exiting before calling bootstrap().
From analysis and logs, it appears that in non-high concurrency situations, from the time the old service begins to gracefully exit until the new service successfully starts, requests can continue to work without downtime.
Logs:
[2024-10-08T03:17:27Z INFO pingora_core::server] SIGHUP received, sending socks and gracefully reloading
[2024-10-08T03:17:27Z INFO pingora_core::server] Trying to send socks
[2024-10-08T03:17:27Z WARN pingora_core::server::transfer_fd] server not ready, will try again in 1s
[2024-10-08T03:17:27Z INFO pingora_core::server] Bootstrap starting
[2024-10-08T03:17:27Z ERROR pingora_core::server::transfer_fd] No incoming socket transfer, sleep 1s and try again
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 109
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 110
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 111
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 112
[2024-10-08T03:17:28Z INFO pingora_core::server] listener sockets sent
[2024-10-08T03:17:28Z INFO pingora_core::server] Bootstrap done
[2024-10-08T03:17:28Z INFO pingora_core::server] Server starting
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 113
... similar debug logs omitted ...
[2024-10-08T03:17:33Z DEBUG server_reload::app::echo] request count: 136
[2024-10-08T03:17:33Z INFO pingora_core::server] Broadcasting graceful shutdown
[2024-10-08T03:17:33Z INFO pingora_core::server] Graceful shutdown started!
[2024-10-08T03:17:33Z INFO pingora_core::server] Broadcast graceful shutdown complete
[2024-10-08T03:17:33Z INFO pingora_core::server] Graceful shutdown: grace period 5s starts
[2024-10-08T03:17:33Z INFO pingora_core::services::listening] Shutting down 0.0.0.0:6145
[2024-10-08T03:17:33Z INFO pingora_core::server] service exited.
[2024-10-08T03:17:33Z DEBUG server_reload::app::echo] request count: 137
... similar debug logs omitted ...
[2024-10-08T03:17:38Z DEBUG server_reload::app::echo] request count: 160
[2024-10-08T03:17:38Z INFO pingora_core::server] Graceful shutdown: grace period ends
[2024-10-08T03:17:38Z INFO pingora_core::server] Waiting for runtimes to exit!
[2024-10-08T03:17:38Z DEBUG server_reload::app::echo] request count: 161
... similar debug logs omitted ...
[2024-10-08T03:17:43Z DEBUG server_reload::app::echo] request count: 184
[2024-10-08T03:17:43Z INFO pingora_core::server] All runtimes exited, exiting now
[2024-10-08T03:17:43Z DEBUG server_reload::app::echo] request count: 185
[2024-10-08T03:17:43Z INFO server_reload] Reload: true
...