pg_auto_failover icon indicating copy to clipboard operation
pg_auto_failover copied to clipboard

When receiving SIGTERM, enable maintenance.

Open DimCitus opened this issue 4 years ago • 3 comments

That way when the node shuts down, the group FSM has done the necessary steps already. If the node was a primary, another node has been elected to take over already.

In most cases (systemd, docker, kubenetes) the way we know that a local node has been asked to get down is the SIGTERM signal, after all.

Fixes #655.

DimCitus avatar Apr 26 '21 20:04 DimCitus

There should be a test added that checks for this new behaviour.

JelteF avatar Apr 28 '21 10:04 JelteF

When trying this out locally with pg_autoctl stop --pgdata node1 (the primary). I got this error in the logs:

12:26:10 18285 INFO  pg_autoctl received signal SIGTERM, terminating
12:26:13 18288 INFO  Monitor assigned new state "maintenance"
12:26:13 18288 INFO  FSM transition from "prepare_maintenance" to "maintenance": Setting up Postgres in standby mode for maintenance operations
12:26:13 18288 INFO  Creating the standby signal file at "/home/jelte/work/pg_auto_failover/tmux/node1/standby.signal", and replication setup at "/home/jelte/work/pg_auto_failover/tmux/node1/postgresql-auto-failover-standby.conf"
12:26:13 18288 INFO  Transition complete: current state is now "maintenance"
12:26:13 18288 INFO  Shutdown sequence complete: reached state "maintenance"
12:26:13 18285 INFO  Service leader with pid 18288 has terminated, now stopping other services
12:26:13 18285 ERROR Failed to send signal SIGTERM to service postgres with pid 18287
12:26:13 18285 INFO  Stop pg_autoctl

JelteF avatar Apr 28 '21 10:04 JelteF

When trying this out locally with pg_autoctl stop --pgdata node1 (the primary). I got this error in the logs:

I can't reproduce at the moment. I suppose that's because the Postgres service was already shut down by the time the supervisor wanted to signal it... a strange race condition, maybe we should integrate some of the logic in the main waitpid loop, but I'm not sure there is a good solution there (race condition between waitpid and kill seems kind of impossible to avoid in general, we could kill -0 <pid> to check if the pid still exists before sending the actual signal but that's still open to a race condition of course).

DimCitus avatar Apr 28 '21 10:04 DimCitus