How to recover from a current/assigned state combination without a possible state transition path?

Open jnehlmeier opened this issue 4 years ago • 4 comments

Consider the following state we encountered today during testing:

   Name |  Node |    Host:Port |        LSN |   Connection |       Current State |      Assigned State
--------+-------+--------------+------------+--------------+---------------------+--------------------
db002-2 |     4 | db002-2:5432 | 0/6709DFE8 |   read-write |        wait_primary |        wait_primary
db002-1 |     6 | db002-1:5432 | 0/6709C120 | read-write ! |        wait_primary |             demoted



 2021-07-13 07:47:17.440282+02 |    0/6 |             primary |             primary | New state is reported by node 6 "db002-1" (db002-1:5432): "primary"
 
 2021-07-13 11:18:27.359978+02 |    0/6 |             primary |            draining | Setting goal state of node 6 "db002-1" (db002-1:5432) to draining and node 4 "db002-2" (db002-2:5432) to prepare_promotion after a user-initiated failover.
 2021-07-13 11:18:27.359978+02 |    0/4 |           secondary |   prepare_promotion | Setting goal state of node 6 "db002-1" (db002-1:5432) to draining and node 4 "db002-2" (db002-2:5432) to prepare_promotion after a user-initiated failover.
 2021-07-13 11:18:28.320592+02 |    0/6 |             primary |        wait_primary | Setting goal state of node 6 "db002-1" (db002-1:5432) to wait_primary because none of the standby nodes in the quorum are healthy at the moment.
 2021-07-13 11:18:28.503273+02 |    0/6 |        wait_primary |        wait_primary | New state is reported by node 6 "db002-1" (db002-1:5432): "wait_primary"
 2021-07-13 11:18:28.768262+02 |    0/4 |   prepare_promotion |   prepare_promotion | New state is reported by node 4 "db002-2" (db002-2:5432): "prepare_promotion"
 2021-07-13 11:18:28.768262+02 |    0/4 |   prepare_promotion |    stop_replication | Setting goal state of node 6 "db002-1" (db002-1:5432) to demote_timeout and node 4 "db002-2" (db002-2:5432) to stop_replication after node 4 "db002-2" (db002-2:5432) converged to prepare_promotion.
 2021-07-13 11:18:28.768262+02 |    0/6 |        wait_primary |      demote_timeout | Setting goal state of node 6 "db002-1" (db002-1:5432) to demote_timeout and node 4 "db002-2" (db002-2:5432) to stop_replication after node 4 "db002-2" (db002-2:5432) converged to prepare_promotion.
 2021-07-13 11:18:31.832889+02 |    0/4 |    stop_replication |    stop_replication | New state is reported by node 4 "db002-2" (db002-2:5432): "stop_replication"
 2021-07-13 11:23:34.388164+02 |    0/6 |        wait_primary |      demote_timeout | Node node 6 "db002-1" (db002-1:5432) is marked as unhealthy by the monitor
 2021-07-13 11:31:18.036089+02 |    0/4 |    stop_replication |        wait_primary | Setting goal state of node 4 "db002-2" (db002-2:5432) to wait_primary and node 6 "db002-1" (db002-1:5432) to demoted after the demote timeout expired.
 2021-07-13 11:31:18.036089+02 |    0/6 |        wait_primary |             demoted | Setting goal state of node 4 "db002-2" (db002-2:5432) to wait_primary and node 6 "db002-1" (db002-1:5432) to demoted after the demote timeout expired.
 2021-07-13 11:31:19.049409+02 |    0/4 |        wait_primary |        wait_primary | New state is reported by node 4 "db002-2" (db002-2:5432): "wait_primary"
  2021-07-13 11:33:42.51511+02 |    0/4 |        wait_primary |        wait_primary | Node node 4 "db002-2" (db002-2:5432) is marked as unhealthy by the monitor
 2021-07-13 11:35:48.654584+02 |    0/4 |        wait_primary |        wait_primary | Node node 4 "db002-2" (db002-2:5432) is marked as healthy by the monitor

Before reaching this state node 6 was primary and node 4 was secondary. We then issued a pg_autoctl perform failover which stopped at

2021-07-13 11:18:31.832889+02 |    0/4 |    stop_replication |    stop_replication | New state is reported by node 4 "db002-2" (db002-2:5432): "stop_replication"

for several minutes. Since node 4 stopped replication we figured that node 6 now needs to finally demote but it did not happen. So we restarted node 6 to see if it solves the issue but we ultimately ended up in the above final state. In this state node 6 did not continue anymore because it does not know how to transition from wait_primary to demoted.

We couldn't figure out how to recover from this state. What needs to be done if there is no transition path between current state and assigned state? Do we really have to drop node 6, deleting all data on node 6 and then create a new secondary from it?

We used some pg_autoctl developer/debug commands (standby init, fsm assign/step) in order to convince node 6 to become a secondary but nothing really helped. At the end we fully reset node 6 and its data and registered it again as a new secondary node.

Jul 13 '21 14:07 jnehlmeier

Hi @jnehlmeier, thanks for another bug report. This one is troubling, having two nodes get both to wait_primary should never happen. Would you have kept pg_autoctl logs from the nodes from that test run?

Other than that, when all things break, the way forward is to drop a node and then create it again. In v1.6.1 it is possible at times to do that without having to delete the PGDATA, and the node can re-join as a secondary from the existing copy of the data set. Sometimes this requires an extra call to pg_rewind before hand, and this is not automated in pg_auto_failover yet. Sometimes the needed WAL files might be missing, too, and then you would have to delete the PGDATA and start afresh anyway.

Jul 16 '21 09:07 DimCitus

Hi @jnehlmeier, thanks for another bug report. This one is troubling, having two nodes get both to wait_primary should never happen. Would you have kept pg_autoctl logs from the nodes from that test run?

Below are logs collected during this test. However the three hosts that run the containers did not had synchronized times so logging timestamps are not easily comparable. I have added the diff in seconds to the headline of each log section below.

As you can see in the logs the monitor did not proceed (and during that pause we restarted the nodes multiple times)

"11:18:31 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): stop_replication ➜ stop_replication"
"11:30:53 1 INFO  pg_autoctl received signal SIGTERM, terminating"

because the monitor has put node 6 already in an unreachable state

MONITOR:

"11:18:28 14 INFO  Setting goal state of node 6 ""db002-1"" (db002-1:5432) to draining and node 4 ""db002-2"" (db002-2:5432) to prepare_promotion after a user-initiated failover."
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): primary ➜ draining"
"11:18:28 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): secondary ➜ prepare_promotion"
"11:18:28 14 INFO  Setting goal state of node 6 ""db002-1"" (db002-1:5432) to wait_primary because none of the standby nodes in the quorum are healthy at the moment."
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): primary ➜ wait_primary"
"11:18:28 14 INFO  New state is reported by node 6 ""db002-1"" (db002-1:5432): ""wait_primary"""
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ wait_primary"
"11:18:28 14 INFO  Setting goal state of node 6 ""db002-1"" (db002-1:5432) to demote_timeout and node 4 ""db002-2"" (db002-2:5432) to stop_replication after node 4 ""db002-2"" (db002-2:5432) converged to prepare_promotion."
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ demote_timeout"


NODE 6 (~+11s timestamp diff):

"11:18:39 30 INFO  Monitor assigned new state ""wait_primary"""
"11:18:39 30 INFO  FSM transition from ""primary"" to ""wait_primary"": Secondary became unhealthy"
"11:18:39 30 INFO  Disabling synchronous replication"
"11:18:39 30 INFO  Reloading Postgres configuration and HBA rules"
"11:18:39 30 INFO  Transition complete: current state is now ""wait_primary"""
"11:18:39 30 INFO  New state for node 4 ""db002-2"" (db002-2:5432): prepare_promotion ➜ prepare_promotion"
"11:18:39 30 INFO  New state for node 4 ""db002-2"" (db002-2:5432): prepare_promotion ➜ stop_replication"
"11:18:39 30 INFO  New state for this node (node 6, ""db002-1"") (db002-1:5432): wait_primary ➜ demote_timeout"
"11:18:39 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:18:39 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:18:39 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "

Full logs

Monitor logs:

"11:18:28 14 INFO  Setting goal state of node 6 ""db002-1"" (db002-1:5432) to draining and node 4 ""db002-2"" (db002-2:5432) to prepare_promotion after a user-initiated failover."
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): primary ➜ draining"
"11:18:28 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): secondary ➜ prepare_promotion"
"11:18:28 14 INFO  Setting goal state of node 6 ""db002-1"" (db002-1:5432) to wait_primary because none of the standby nodes in the quorum are healthy at the moment."
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): primary ➜ wait_primary"
"11:18:28 14 INFO  New state is reported by node 6 ""db002-1"" (db002-1:5432): ""wait_primary"""
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ wait_primary"
"11:18:28 14 INFO  Setting goal state of node 6 ""db002-1"" (db002-1:5432) to demote_timeout and node 4 ""db002-2"" (db002-2:5432) to stop_replication after node 4 ""db002-2"" (db002-2:5432) converged to prepare_promotion."
"11:18:28 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ demote_timeout"
"11:18:28 14 INFO  New state is reported by node 4 ""db002-2"" (db002-2:5432): ""prepare_promotion"""
"11:18:28 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): prepare_promotion ➜ prepare_promotion"
"11:18:28 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): prepare_promotion ➜ stop_replication"
"11:18:31 14 INFO  New state is reported by node 4 ""db002-2"" (db002-2:5432): ""stop_replication"""
"11:18:31 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): stop_replication ➜ stop_replication"
"11:30:53 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"11:30:53 14 INFO  Listener service received signal SIGTERM, terminating"
"11:30:53 13 INFO  Postgres controller service received signal SIGTERM, terminating"
"11:30:53 13 INFO  Stopping pg_autoctl postgres service"
"11:30:53 13 INFO  /usr/lib/postgresql/13/bin/pg_ctl --pgdata /var/lib/postgresql/monitor --wait stop --mode fast"
"11:30:53 1 INFO  Waiting for subprocesses to terminate."
"11:30:53 1 ERROR Unknown subprocess died with pid 24"
"11:30:53 1 INFO  Stop pg_autoctl"



... USER-INITIATED CONTAINER RESTART ...



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass ... "
"Monitor already created."
"Generating pg_hba.conf using allowed subnet 10.41.1.0/24 ..."
"Starting pg_autoctl..."
"11:30:55 1 INFO  Started pg_autoctl postgres service with pid 13"
"11:30:55 13 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/monitor -v"
"11:30:55 1 INFO  Started pg_autoctl listener service with pid 14"
"11:30:55 14 INFO   /usr/bin/pg_autoctl do service listener --pgdata /var/lib/postgresql/monitor -v"
"11:30:55 14 INFO  Managing the monitor at postgres://autoctl_node@db002-monitor:5432/pg_auto_failover?sslmode=require"
"11:30:55 14 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/monitor/pg_autoctl.cfg"""
"11:30:56 23 INFO   /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/monitor -p 5432 -h *"
"11:30:56 14 INFO  The version of extension ""pgautofailover"" is ""1.5"" on the monitor"
"11:30:56 14 INFO  Contacting the monitor to LISTEN to its events."
"11:30:56 13 INFO  Postgres is now serving PGDATA ""/var/lib/postgresql/monitor"" on port 5432 with pid 23"
"11:31:18 14 INFO  Setting goal state of node 4 ""db002-2"" (db002-2:5432) to wait_primary and node 6 ""db002-1"" (db002-1:5432) to demoted after the demote timeout expired."
"11:31:18 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): stop_replication ➜ wait_primary"
"11:31:18 14 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ demoted"
"11:31:19 14 INFO  New state is reported by node 4 ""db002-2"" (db002-2:5432): ""wait_primary"""
"11:31:19 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): wait_primary ➜ wait_primary"
"13:10:25 14 INFO  Node node 4 ""db002-2"" (db002-2:5432) is marked as unhealthy by the monitor"
"13:10:25 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): wait_primary ➜ wait_primary"
"13:10:25 14 INFO  Node node 4 ""db002-2"" (db002-2:5432) is marked as healthy by the monitor"
"13:10:25 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): wait_primary ➜ wait_primary"
"13:10:25 14 INFO  Removing node 6 ""db002-1"" (db002-1:5432) from formation ""default"" and group 0"
"13:10:25 14 INFO  Setting number_sync_standbys to 0 for formation ""default"" now that we have 0 standby nodes set with replication-quorum."
"13:10:25 14 INFO  Setting goal state of node 4 ""db002-2"" (db002-2:5432) to single as there is no other node."
"13:10:25 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): wait_primary ➜ single"
"13:10:25 14 INFO  Setting goal state of node 4 ""db002-2"" (db002-2:5432) to single as there is no other node."
"13:10:25 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): wait_primary ➜ single"
"13:10:26 14 INFO  New state is reported by node 4 ""db002-2"" (db002-2:5432): ""single"""
"13:10:26 14 INFO  New state for node 4 ""db002-2"" (db002-2:5432): single ➜ single"
"13:48:17 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"13:48:17 13 INFO  Postgres controller service received signal SIGTERM, terminating"
"13:48:17 14 INFO  Listener service received signal SIGTERM, terminating"
"13:48:17 13 INFO  Stopping pg_autoctl postgres service"
"13:48:17 13 INFO  /usr/lib/postgresql/13/bin/pg_ctl --pgdata /var/lib/postgresql/monitor --wait stop --mode fast"
"13:48:17 1 ERROR Unknown subprocess died with pid 24"
"13:48:18 1 INFO  Stop pg_autoctl"

Node 6 (was primary before pg_autoctl perform failover, timestamps are ~+11s to monitor)

"11:18:39 30 INFO  Monitor assigned new state ""wait_primary"""
"11:18:39 30 INFO  FSM transition from ""primary"" to ""wait_primary"": Secondary became unhealthy"
"11:18:39 30 INFO  Disabling synchronous replication"
"11:18:39 30 INFO  Reloading Postgres configuration and HBA rules"
"11:18:39 30 INFO  Transition complete: current state is now ""wait_primary"""
"11:18:39 30 INFO  New state for node 4 ""db002-2"" (db002-2:5432): prepare_promotion ➜ prepare_promotion"
"11:18:39 30 INFO  New state for node 4 ""db002-2"" (db002-2:5432): prepare_promotion ➜ stop_replication"
"11:18:39 30 INFO  New state for this node (node 6, ""db002-1"") (db002-1:5432): wait_primary ➜ demote_timeout"
"11:18:39 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:18:39 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:18:39 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:18:40 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:18:40 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:18:40 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:18:41 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:18:41 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:18:41 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:18:42 30 INFO  New state for node 4 ""db002-2"" (db002-2:5432): stop_replication ➜ stop_replication"
"11:18:42 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:18:42 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:18:42 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:18:43 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:18:43 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:18:43 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "

... LOTS OF RETRIES ...

"11:23:33 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:23:33 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:23:33 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:23:34 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"11:23:34 29 INFO  Postgres controller service received signal SIGTERM, terminating"
"11:23:34 29 INFO  Stopping pg_autoctl postgres service"
"11:23:34 29 INFO  /usr/lib/postgresql/13/bin/pg_ctl --pgdata /var/lib/postgresql/node --wait stop --mode fast"
"11:23:34 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:23:34 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:23:34 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:23:34 1 INFO  Stop pg_autoctl"



... USER-INITIATED CONTAINER RESTART ...



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass file ... "
"pg_auto_failover node already created."
"Setting replication.password ..."
"Node configuration done."
"Starting pg_autoctl..."
"11:23:35 1 INFO  Started pg_autoctl postgres service with pid 29"
"11:23:35 29 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/node -v"
"11:23:35 1 INFO  Started pg_autoctl node-active service with pid 30"
"11:23:35 30 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/node -v"
"11:23:35 30 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/node/pg_autoctl.cfg"""
"11:23:35 30 INFO  pg_autoctl service is running, current state is ""wait_primary"""
"11:23:35 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:23:36 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:23:36 30 INFO  Checking for HBA rules for node 4 ""db002-2"" (db002-2:5432)"
"11:23:36 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-2 trust"
"11:23:36 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-2 trust"
"11:23:36 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:23:36 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:23:36 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:23:37 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:23:37 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:23:37 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:23:37 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "

... LOTS OF RETRIES ...

"11:29:56 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:29:56 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:29:56 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:29:56 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:29:57 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:29:57 30 INFO  Monitor assigned new state ""demote_timeout"""
"11:29:57 30 FATAL pg_autoctl does not know how to reach state ""demote_timeout"" from ""wait_primary"""
"11:29:57 30 ERROR Failed to transition to state ""demote_timeout"", retrying... "
"11:29:58 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"11:29:58 29 INFO  Postgres controller service received signal SIGTERM, terminating"
"11:29:58 1 INFO  Stop pg_autoctl"



... USER-INITIATED CONTAINER RESTART ...



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass file ... "
"pg_auto_failover node already created."
"Setting replication.password ..."
"Node configuration done."
"Starting pg_autoctl..."
"11:32:25 1 INFO  Started pg_autoctl postgres service with pid 29"
"11:32:25 29 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/node -v"
"11:32:25 1 INFO  Started pg_autoctl node-active service with pid 30"
"11:32:25 30 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/node -v"
"11:32:25 30 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/node/pg_autoctl.cfg"""
"11:32:25 30 INFO  pg_autoctl service is running, current state is ""wait_primary"""
"11:32:25 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:32:25 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:32:25 30 INFO  Checking for HBA rules for node 4 ""db002-2"" (db002-2:5432)"
"11:32:25 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-2 trust"
"11:32:25 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-2 trust"
"11:32:25 30 INFO  Monitor assigned new state ""demoted"""
"11:32:25 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:32:25 30 ERROR Failed to transition to state ""demoted"", retrying... "
"11:32:27 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:32:27 30 INFO  Monitor assigned new state ""demoted"""
"11:32:27 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:32:27 30 ERROR Failed to transition to state ""demoted"", retrying... "

... LOTS OF RETRIES ....

"11:33:32 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:33:33 30 INFO  Monitor assigned new state ""demoted"""
"11:33:33 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:33:33 30 ERROR Failed to transition to state ""demoted"", retrying... "
"11:33:34 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:33:34 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"11:33:34 29 INFO  Postgres controller service received signal SIGTERM, terminating"
"11:33:34 30 INFO  Monitor assigned new state ""demoted"""
"11:33:34 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:33:34 30 ERROR Failed to transition to state ""demoted"", retrying... "
"11:33:34 1 INFO  Stop pg_autoctl"



... USER-INITIATED CONTAINER RESTART ....



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass file ... "
"pg_auto_failover node already created."
"Setting replication.password ..."
"Node configuration done."
"Starting pg_autoctl..."
"11:33:52 1 INFO  Started pg_autoctl postgres service with pid 29"
"11:33:52 29 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/node -v"
"11:33:52 1 INFO  Started pg_autoctl node-active service with pid 30"
"11:33:52 30 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/node -v"
"11:33:52 30 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/node/pg_autoctl.cfg"""
"11:33:52 30 INFO  pg_autoctl service is running, current state is ""wait_primary"""
"11:33:52 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:33:53 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:33:53 30 INFO  Checking for HBA rules for node 4 ""db002-2"" (db002-2:5432)"
"11:33:53 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-2 trust"
"11:33:53 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-2 trust"
"11:33:53 30 INFO  Monitor assigned new state ""demoted"""
"11:33:53 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:33:53 30 ERROR Failed to transition to state ""demoted"", retrying... "
"11:33:54 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:33:54 30 INFO  Monitor assigned new state ""demoted"""
"11:33:54 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:33:54 30 ERROR Failed to transition to state ""demoted"", retrying... "
"11:33:55 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:33:55 30 INFO  Monitor assigned new state ""demoted"""
"11:33:55 30 FATAL pg_autoctl does not know how to reach state ""demoted"" from ""wait_primary"""
"11:33:55 30 ERROR Failed to transition to state ""demoted"", retrying... "

... ENDLESS RETRIES UNTIL WE STOPPED THE TEST ...

Node 4 (was secondary before pg_autoctl perform failover, timestamps are ~+13s to monitor)

"11:18:41 30 ERROR Monitor ERROR:  deadlock detected"
"11:18:41 30 ERROR Monitor DETAIL:  Process 616799 waits for ExclusiveLock on advisory lock [16385,822708183,0,11]; blocked by process 616803."
"11:18:41 30 ERROR Monitor Process 616803 waits for ShareLock on transaction 341083; blocked by process 616799."
"11:18:41 30 ERROR Monitor HINT:  See server log for query details."
"11:18:41 30 ERROR Failed to get node state for node 4 in group 0 of formation ""default"" with initial state ""secondary"", replication state """", and current lsn ""0/6709C120"", see previous lines for details"
"11:18:41 30 ERROR Failed to get the goal state from the monitor"
"11:18:41 30 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ wait_primary"
"11:18:41 30 INFO  Successfully got the goal state from the monitor"
"11:18:41 30 INFO  FSM transition from ""secondary"" to ""prepare_promotion"": Stop traffic to primary, wait for it to finish draining."
"11:18:41 30 INFO  Monitor assigned new state ""prepare_promotion"""
"11:18:41 30 INFO  Transition complete: current state is now ""prepare_promotion"""
"11:18:42 30 INFO  FSM transition from ""prepare_promotion"" to ""stop_replication"": Prevent against split-brain situations."
"11:18:42 30 INFO  Setting default_transaction_read_only to on"
"11:18:42 30 INFO  Monitor assigned new state ""stop_replication"""
"11:18:42 30 INFO  Prevent writes to the promoted standby while the primary is not demoted yet, by making the service incompatible with target_session_attrs = read-write"
"11:18:42 30 INFO  Reloading Postgres configuration and HBA rules"
"11:18:42 30 INFO  Promoting postgres"
"11:18:42 30 INFO  Waiting for postgres to promote"
"11:18:44 30 INFO  Cleaning-up Postgres replication settings"
"11:18:44 30 INFO  Disabling synchronous replication"
"11:18:44 30 INFO  Reloading Postgres configuration and HBA rules"
"11:18:45 30 INFO  Transition complete: current state is now ""stop_replication"""
"11:18:45 30 INFO  New state for this node (node 4, ""db002-2"") (db002-2:5432): prepare_promotion ➜ prepare_promotion"
"11:18:45 30 INFO  New state for this node (node 4, ""db002-2"") (db002-2:5432): prepare_promotion ➜ stop_replication"
"11:18:45 30 INFO  New state for node 6 ""db002-1"" (db002-1:5432): wait_primary ➜ demote_timeout"
"11:18:45 30 INFO  New state for this node (node 4, ""db002-2"") (db002-2:5432): stop_replication ➜ stop_replication"
"11:20:57 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"11:20:57 29 INFO  Postgres controller service received signal SIGTERM, terminating"
"11:20:57 29 INFO  Stopping pg_autoctl postgres service"
"11:20:57 29 INFO  /usr/lib/postgresql/13/bin/pg_ctl --pgdata /var/lib/postgresql/node --wait stop --mode fast"
"11:20:57 1 INFO  Waiting for subprocesses to terminate."
"11:20:59 1 ERROR Unknown subprocess died with pid 105884"
"11:20:59 1 INFO  Stop pg_autoctl"



... USER-INITIATED CONTAINER RESTART ...



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass file ... "
"pg_auto_failover node already created."
"Setting replication.password ..."
"Node configuration done."
"Starting pg_autoctl..."
"11:21:02 1 INFO  Started pg_autoctl postgres service with pid 29"
"11:21:02 29 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/node -v"
"11:21:02 1 INFO  Started pg_autoctl node-active service with pid 30"
"11:21:02 30 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/node -v"
"11:21:02 30 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/node/pg_autoctl.cfg"""
"11:21:02 30 INFO  Contents of ""/var/lib/postgresql/node/postgresql-auto-failover.conf"" have changed, overwriting"
"11:21:02 30 INFO  pg_autoctl service is running, current state is ""stop_replication"""
"11:21:02 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:21:02 30 INFO  Checking for HBA rules for node 6 ""db002-1"" (db002-1:5432)"
"11:21:02 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-1 trust"
"11:21:02 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-1 trust"
"11:21:02 38 INFO   /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/node -p 5432 -h *"
"11:21:03 29 INFO  Postgres is now serving PGDATA ""/var/lib/postgresql/node"" on port 5432 with pid 38"
"11:21:03 30 WARN  PostgreSQL was not running, restarted with pid 38"
"11:30:06 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"11:30:06 29 INFO  Postgres controller service received signal SIGTERM, terminating"
"11:30:06 29 INFO  Stopping pg_autoctl postgres service"
"11:30:06 29 INFO  /usr/lib/postgresql/13/bin/pg_ctl --pgdata /var/lib/postgresql/node --wait stop --mode fast"
"11:30:06 1 INFO  Waiting for subprocesses to terminate."
"11:30:06 1 ERROR Unknown subprocess died with pid 39"
"11:30:07 1 INFO  Stop pg_autoctl"



... USER-INITIATED CONTAINER RESTART ...



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass file ... "
"pg_auto_failover node already created."
"Setting replication.password ..."
"Node configuration done."
"Starting pg_autoctl..."
"11:31:31 1 INFO  Started pg_autoctl postgres service with pid 29"
"11:31:31 29 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/node -v"
"11:31:31 1 INFO  Started pg_autoctl node-active service with pid 30"
"11:31:31 30 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/node -v"
"11:31:31 30 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/node/pg_autoctl.cfg"""
"11:31:31 30 INFO  pg_autoctl service is running, current state is ""stop_replication"""
"11:31:31 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:31:31 30 INFO  Checking for HBA rules for node 6 ""db002-1"" (db002-1:5432)"
"11:31:31 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-1 trust"
"11:31:31 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-1 trust"
"11:31:31 30 INFO  Monitor assigned new state ""wait_primary"""
"11:31:31 38 INFO   /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/node -p 5432 -h *"
"11:31:31 29 INFO  Postgres is now serving PGDATA ""/var/lib/postgresql/node"" on port 5432 with pid 38"
"11:31:32 30 WARN  PostgreSQL was not running, restarted with pid 38"
"11:31:32 30 INFO  FSM transition from ""stop_replication"" to ""wait_primary"": Confirmed promotion with the monitor"
"11:31:32 30 INFO  Setting default_transaction_read_only to off"
"11:31:32 30 INFO  Reloading Postgres configuration and HBA rules"
"11:31:32 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:31:32 30 INFO  Checking for HBA rules for node 6 ""db002-1"" (db002-1:5432)"
"11:31:32 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-1 trust"
"11:31:32 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-1 trust"
"11:31:32 30 INFO  Transition complete: current state is now ""wait_primary"""

... SOME SHUTDOWN LOGS MISSING ...



... USER-INITIATED CONTAINER RESTART ...



"Given log level not defined or unknown"
"Using pg_auto_failover log level INFO ..."
"Generating .pgpass file ... "
"pg_auto_failover node already created."
"Setting replication.password ..."
"Node configuration done."
"Starting pg_autoctl..."
"11:35:59 1 INFO  Started pg_autoctl postgres service with pid 29"
"11:35:59 29 INFO   /usr/bin/pg_autoctl do service postgres --pgdata /var/lib/postgresql/node -v"
"11:35:59 1 INFO  Started pg_autoctl node-active service with pid 30"
"11:35:59 30 INFO   /usr/bin/pg_autoctl do service node-active --pgdata /var/lib/postgresql/node -v"
"11:35:59 30 INFO  Reloaded the new configuration from ""/home/postgres/.config/pg_autoctl/var/lib/postgresql/node/pg_autoctl.cfg"""
"11:35:59 30 INFO  pg_autoctl service is running, current state is ""wait_primary"""
"11:35:59 30 WARN  Failed to update the keeper's state from the local PostgreSQL instance."
"11:35:59 30 INFO  Fetched current list of 1 other nodes from the monitor to update HBA rules, including 1 changes."
"11:35:59 30 INFO  Checking for HBA rules for node 6 ""db002-1"" (db002-1:5432)"
"11:35:59 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl replication ""pgautofailover_replicator"" db002-1 trust"
"11:35:59 30 WARN  Skipping HBA edits (per --skip-pg-hba) for rule: hostssl ""postgres"" ""pgautofailover_replicator"" db002-1 trust"
"11:35:59 38 INFO   /usr/lib/postgresql/13/bin/postgres -D /var/lib/postgresql/node -p 5432 -h *"
"11:35:59 30 WARN  PostgreSQL was not running, restarted with pid 38"
"11:35:59 29 INFO  Postgres is now serving PGDATA ""/var/lib/postgresql/node"" on port 5432 with pid 38"
"11:36:00 30 INFO  Updated the keeper's state from the local PostgreSQL instance, which is running"
"11:36:00 30 INFO  pg_autoctl managed to ensure current state ""wait_primary"": PostgreSQL is running"
"12:11:37 30 INFO  Creating replication slot ""pgautofailover_standby_6"""
"13:10:39 30 INFO  New state for this node (node 4, ""db002-2"") (db002-2:5432): wait_primary ➜ single"
"13:10:39 30 INFO  Monitor assigned new state ""single"""
"13:10:39 30 INFO  Dropping replication slot ""pgautofailover_standby_6"""
"13:10:39 30 INFO  FSM transition from ""wait_primary"" to ""single"": Other node was forcibly removed, now single"
"13:10:39 30 INFO  Disabling synchronous replication"
"13:10:39 30 INFO  Reloading Postgres configuration and HBA rules"
"13:10:39 30 INFO  Transition complete: current state is now ""single"""
"13:10:39 30 INFO  New state for this node (node 4, ""db002-2"") (db002-2:5432): wait_primary ➜ single"
"13:10:39 30 INFO  New state for this node (node 4, ""db002-2"") (db002-2:5432): single ➜ single"
"13:34:40 1 INFO  pg_autoctl received signal SIGTERM, terminating"
"13:34:40 29 INFO  Postgres controller service received signal SIGTERM, terminating"
"13:34:40 29 INFO  Stopping pg_autoctl postgres service"
"13:34:40 29 INFO  /usr/lib/postgresql/13/bin/pg_ctl --pgdata /var/lib/postgresql/node --wait stop --mode fast"
"13:34:40 1 INFO  Waiting for subprocesses to terminate."
"13:34:40 1 ERROR Unknown subprocess died with pid 39"
"13:34:40 1 INFO  Stop pg_autoctl"

Jul 20 '21 15:07 jnehlmeier

On a first reading it looks like the secondary node is failing right after a manual failover has been started. Is that correct? If that's the case, I will play with testing this scenario and adding coverage in the test suite (unless we already have it).

Also please note that if we're seeing a failure during a failover operation, then we are in the territory of “multiple node failures” which where we have yet work to do. Our model at the moment is meant to fully tolerate the failure of any one node at a time, and knows how to handle multiple failures in some cases only.

Aug 04 '21 15:08 DimCitus

Both postgres nodes were running at the time of failover. I believe there was some docker networking hiccup and pg_auto_failover already knew that the secondary is gone but did not take any action yet (maybe some timeout wasn't reached yet). However it did take the action as soon as we manually triggered the failover process. Sadly it does it at a bad time and breaks FSM.

Basically the primary should progress primary -> draining -> demote_timeout. But during that transition the monitor sets the state of the primary to wait_primary since the secondary wasn't reachable. When the monitor does this, it should not continue with the usual (requested) failover transition because that's why we end up in wait_primary -> demote_timeout.

Aug 04 '21 16:08 jnehlmeier