pg_auto_failover Wrong password used by pg_autoctl during monitor connect

pg_auto_failover version 1.5.2

During first initialization of cluster (monitor + 2 postgres nodes) we discovered that secondary postgres node uses the replication password when connecting to the monitor after initial base backup has been done and postgres starts the first time.

During initial testing we used the same password for autoctl_node and pgautofailover_replicator and everything did work as expected. Then we decided to use two different passwords and pg_auto_failover was able to register the secondary postgres node with the monitor and do a base backup from the primary node. But after creating the base backup pg_autoctl fails to connect to the monitor node to ask for other nodes. Reverting the replication password back to the monitor password resolves the issue.

Also worth mentioning that we are using .pgpass file and do not have any passwords in monitor URL. The .pgpass file has been verified to contain the correct passwords for autoctl_node and pgautofailover_replicator and connecting manually from console as postgres user works as expected. Adding the monitor password directly to the monitor URL fixes the problem as well. So the issue only occurs when using .pgpass.

This method does not return true after several attempts to connect to the monitor: https://github.com/citusdata/pg_auto_failover/blob/125bb19c13ea8af1d6aacd5f2f9f7cd51b406e74/src/bin/pg_autoctl/keeper.c#L1840

It seems that the pg client instance attached to the monitor is not correctly configured.

The following log is after a successful base backup and first start of postgres (db001-m is the monitor node):

14:22:51 30 DEBUG pgsql.c:500 Connecting to [monitor] "postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require"
14:22:51 30 WARN  pgsql.c:598 Failed to connect to "postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require", retrying until the server is ready
14:22:51 30 DEBUG pgsql.c:643 PQping(postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require): slept 5 ms on attempt 2
14:22:51 30 DEBUG pgsql.c:657 PQping OK after 2 attempts
14:22:52 30 DEBUG pgsql.c:464 Connection to database failed: FATAL:  password authentication failed for user "autoctl_node"
14:22:52 30 WARN  pgsql.c:708 Failed to connect after successful ping
14:22:52 30 DEBUG pgsql.c:423 Disconnecting from [monitor] "postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require"
14:22:52 30 DEBUG pgsql.c:643 PQping(postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require): slept 9 ms on attempt 3
14:22:52 30 DEBUG pgsql.c:657 PQping OK after 3 attempts
...
...
14:22:54 30 DEBUG pgsql.c:657 PQping OK after 11 attempts
14:22:54 30 DEBUG pgsql.c:423 Disconnecting from [monitor] "postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require"
14:22:54 30 ERROR pgsql.c:623 Failed to connect to "postgres://autoctl_node@db001-m:5432/pg_auto_failover?sslmode=require" after 11 attempts in 2271 ms, pg_autoctl stops retrying now
14:22:54 30 ERROR monitor.c:498 Failed to get other nodes from the monitor while running "SELECT * FROM pgautofailover.get_other_nodes($1) ORDER BY node_id" with node id 56
14:22:54 30 ERROR keeper.c:1842 Failed to get_other_nodes() on the monitor
14:22:54 30 ERROR fsm_transition.c:872 Failed to update HBA rules after a base backup
14:22:54 30 DEBUG fsm.c:501 Transition function returned: false
14:22:54 30 ERROR fsm.c:519 Failed to transition from state "wait_standby" to state "catchingup", see above.
14:22:54 30 FATAL cli_do_service.c:572 Failed to initialize the node active service, see above for details
14:22:54 1 ERROR supervisor.c:653 pg_autoctl service node-active exited with exit status 12
14:22:54 1 DEBUG supervisor.c:790 supervisor_may_restart: service "node-active" restarted 1 times, most recently at Fri Jul  9 14:22:43 2021, 11 seconds ago
14:22:54 1 INFO  supervisor.c:738 Restarting service node-active

Jul 09 '21 12:07 jnehlmeier

Hi @jnehlmeier ; thanks for your detailed bug report: I believe the error is found here:

https://github.com/citusdata/pg_auto_failover/blob/0fecd653f9ef43019e4d76e347afce378adcc481/src/bin/pg_autoctl/pgctl.c#L1207-L1210

The way we set the replication password when using either pg_basebackup or pg_rewind is through using the PGPASSWORD environment variable, and this environment variable setting is left in place after the command has run. This also matches with your observations.

We could either call unsetenv once the Postgres command has run to completion, or maybe devise another way to set the password. Looking more into it, I see the following part of the code:

https://github.com/citusdata/pg_auto_failover/blob/0fecd653f9ef43019e4d76e347afce378adcc481/src/bin/pg_autoctl/pgctl.c#L2417-L2420

Now I wonder if we could just remove the environment variable setting altogether and see the bug fixed. Can you try removing lines 1207-1210 as linked above (the setenv call) and see if that works for you?

Jul 09 '21 12:07 DimCitus

@DimCitus

I am on Mac OS and not very familiar on how to setup 'make' and pg_auto_failover dependencies correctly in order to compile for Linux. So I can probably not test it quickly.

Jul 09 '21 13:07 jnehlmeier

On Mac I just install https://postgresapp.com and then the Xcode command line tools, and that should be enough to get started. I must admit I have a full dev environment and I'm also using brew to install additional packages, so I'm not sure if that's the entire story here.

Jul 09 '21 13:07 DimCitus