When receiving SIGTERM, enable maintenance.
That way when the node shuts down, the group FSM has done the necessary steps already. If the node was a primary, another node has been elected to take over already.
In most cases (systemd, docker, kubenetes) the way we know that a local node has been asked to get down is the SIGTERM signal, after all.
Fixes #655.
There should be a test added that checks for this new behaviour.
When trying this out locally with pg_autoctl stop --pgdata node1 (the primary). I got this error in the logs:
12:26:10 18285 INFO pg_autoctl received signal SIGTERM, terminating
12:26:13 18288 INFO Monitor assigned new state "maintenance"
12:26:13 18288 INFO FSM transition from "prepare_maintenance" to "maintenance": Setting up Postgres in standby mode for maintenance operations
12:26:13 18288 INFO Creating the standby signal file at "/home/jelte/work/pg_auto_failover/tmux/node1/standby.signal", and replication setup at "/home/jelte/work/pg_auto_failover/tmux/node1/postgresql-auto-failover-standby.conf"
12:26:13 18288 INFO Transition complete: current state is now "maintenance"
12:26:13 18288 INFO Shutdown sequence complete: reached state "maintenance"
12:26:13 18285 INFO Service leader with pid 18288 has terminated, now stopping other services
12:26:13 18285 ERROR Failed to send signal SIGTERM to service postgres with pid 18287
12:26:13 18285 INFO Stop pg_autoctl
When trying this out locally with
pg_autoctl stop --pgdata node1(the primary). I got this error in the logs:
I can't reproduce at the moment. I suppose that's because the Postgres service was already shut down by the time the supervisor wanted to signal it... a strange race condition, maybe we should integrate some of the logic in the main waitpid loop, but I'm not sure there is a good solution there (race condition between waitpid and kill seems kind of impossible to avoid in general, we could kill -0 <pid> to check if the pid still exists before sending the actual signal but that's still open to a race condition of course).