pingora icon indicating copy to clipboard operation
pingora copied to clipboard

Add non-exiting methods and implement SIGHUP reload

Open samurai00 opened this issue 1 year ago • 3 comments

Hey team! I've tried to make a few small tweaks to our server to hopefully improve its flexibility and add some reload functionality. Here's what's new:

  • Added pub fn try_bootstrap(&mut self) -> Result<bool> that doesn't call exit
  • Refactored pub fn bootstrap(&mut self) to use the new try_bootstrap method
  • Added pub fn run_server(mut self, enable_daemon: bool) -> Result<bool> that doesn't call exit
  • Kept pub fn run_forever(mut self) -> ! the same, but now it calls run_server under the hood
  • Implemented SIGHUP catching for reloading
  • Added an example server_reload to show how to reload within a tokio runtime

Related issues:

  • #154
  • #292

samurai00 avatar Sep 30 '24 10:09 samurai00

I think it is nice to break down the functions so that it is more flexible.

On the other hand, the reload in the example has a downtime (the old service shuts down before the new service starts). Do you have a plan to address that?

To me, the way of graceful reload is the following when the reload signal arrives:

  1. start the new service and wait for the listeners to be passed to it
  2. signal the old service to gracefully exit
  3. wait for grace_period_seconds before fully stopping/killing the old service

The current run_server() does 2) + 3) before returning for the example to do 1). So there is a window that the service appear to be offline.

eaufavor avatar Sep 30 '24 19:09 eaufavor

I agree that the new service should be started before shutting down the old one, and I thought I had attempted to do this in the example. After conducting some simple tests, it seemed to work as expected.

However, more thorough testing might be necessary to ensure everything is functioning correctly. I'll conduct some additional tests to verify this.

samurai00 avatar Oct 01 '24 05:10 samurai00

I think it is nice to break down the functions so that it is more flexible.

On the other hand, the reload in the example has a downtime (the old service shuts down before the new service starts). Do you have a plan to address that?

To me, the way of graceful reload is the following when the reload signal arrives:

  1. start the new service and wait for the listeners to be passed to it
  2. signal the old service to gracefully exit
  3. wait for grace_period_seconds before fully stopping/killing the old service

The current run_server() does 2) + 3) before returning for the example to do 1). So there is a window that the service appear to be offline.

After trying, I found that in the example, the actual behavior is:

  1. signal the old service to gracefully exit
  2. start the new service and wait for the listeners to be passed to it

These two steps can be considered to occur simultaneously.

If the old service doesn't start to gracefully exit, the call to bootstrap() during the upgrade will fail, preventing the new service from actually starting. It seems there wouldn't be much difference if we notify the old service to start gracefully exiting before calling bootstrap().

From analysis and logs, it appears that in non-high concurrency situations, from the time the old service begins to gracefully exit until the new service successfully starts, requests can continue to work without downtime.

Logs:

[2024-10-08T03:17:27Z INFO  pingora_core::server] SIGHUP received, sending socks and gracefully reloading
[2024-10-08T03:17:27Z INFO  pingora_core::server] Trying to send socks
[2024-10-08T03:17:27Z WARN  pingora_core::server::transfer_fd] server not ready, will try again in 1s
[2024-10-08T03:17:27Z INFO  pingora_core::server] Bootstrap starting
[2024-10-08T03:17:27Z ERROR pingora_core::server::transfer_fd] No incoming socket transfer, sleep 1s and try again
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 109
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 110
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 111
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 112
[2024-10-08T03:17:28Z INFO  pingora_core::server] listener sockets sent
[2024-10-08T03:17:28Z INFO  pingora_core::server] Bootstrap done
[2024-10-08T03:17:28Z INFO  pingora_core::server] Server starting
[2024-10-08T03:17:28Z DEBUG server_reload::app::echo] request count: 113
... similar debug logs omitted ...
[2024-10-08T03:17:33Z DEBUG server_reload::app::echo] request count: 136
[2024-10-08T03:17:33Z INFO  pingora_core::server] Broadcasting graceful shutdown
[2024-10-08T03:17:33Z INFO  pingora_core::server] Graceful shutdown started!
[2024-10-08T03:17:33Z INFO  pingora_core::server] Broadcast graceful shutdown complete
[2024-10-08T03:17:33Z INFO  pingora_core::server] Graceful shutdown: grace period 5s starts
[2024-10-08T03:17:33Z INFO  pingora_core::services::listening] Shutting down 0.0.0.0:6145
[2024-10-08T03:17:33Z INFO  pingora_core::server] service exited.
[2024-10-08T03:17:33Z DEBUG server_reload::app::echo] request count: 137
... similar debug logs omitted ...
[2024-10-08T03:17:38Z DEBUG server_reload::app::echo] request count: 160
[2024-10-08T03:17:38Z INFO  pingora_core::server] Graceful shutdown: grace period ends
[2024-10-08T03:17:38Z INFO  pingora_core::server] Waiting for runtimes to exit!
[2024-10-08T03:17:38Z DEBUG server_reload::app::echo] request count: 161
... similar debug logs omitted ...
[2024-10-08T03:17:43Z DEBUG server_reload::app::echo] request count: 184
[2024-10-08T03:17:43Z INFO  pingora_core::server] All runtimes exited, exiting now
[2024-10-08T03:17:43Z DEBUG server_reload::app::echo] request count: 185
[2024-10-08T03:17:43Z INFO  server_reload] Reload: true
...

samurai00 avatar Oct 08 '24 03:10 samurai00