pg_auto_failover Setting replication.password doesn't set the password in the primary's database instance

Description

Following the steps in the documentation here to setup password based authentication is not enough without an added: psql -p ${PRIMARY_PORT} postgres -c "alter user pgautofailover_replicator password 'h4ckm3m0r3';

This ALTER command needs to be run before the secondary is created and started:

$ export PGPASSWORD=h4ckm3m0r3
$ pg_autoctl create postgres       \
       --auth scram-sha-256        \
       ...                         \
       --monitor postgres://autoctl_node:[email protected]/pg_auto_failover

$ pg_autoctl config set replication.password h4ckm3m0r3

Log excerpt when we try to create the secondary (pg_basebackup fails in auth):

16:14:46 538293 INFO   /usr/local/pgsql/bin/pg_basebackup -w -d application_name=pgautofailover_standby_5 host=127.0.0.1 port=8800 user=pgautofailover_replicator sslmode=prefer --pgdata /home/pivotal/failover/backup/node_5 -U pgautofailover_replicator --verbose --progress -
-max-rate 100M --wal-method=stream --slot pgautofailover_standby_5                                                                                                                                                                                                                
16:14:46 538293 INFO  pg_basebackup: error: could not connect to server: FATAL:  password authentication failed for user "pgautofailover_replicator"
16:14:46 538293 ERROR Failed to run pg_basebackup: exit code 1
16:14:46 538293 ERROR Failed initialize standby server, see above for details
16:14:46 538293 ERROR Failed to transition from state "wait_standby" to state "catchingup", see above.

RCA:

If we look at pg_shadow after pg_autoctl config set replication.password h4ckm3m0r3:

$ psql postgres -p ${PRIMARY_PORT} -c "select usename, passwd from pg_shadow where usename='pgautofailover_replicator';"
          usename          | passwd 
---------------------------+--------
 pgautofailover_replicator | 
(1 row)

We see that pgautofailover_replicator is created w/ a blank password. In the code ~~keeper_cli_create_replication_user~~ primary_create_replication_user creates the pgautofailover_replicator user with the password that it reads from the config during the course of a create postgres. At that time replication.password is undefined in the config.

Conclusions and questions:

I think we have to either add psql -p ${PRIMARY_PORT} postgres -c "alter user pgautofailover_replicator password 'h4ckm3m0r3'; to the documentation or have pg_autoctl config set replication.password do the ALTER for us.
Is there a way to specify the replication.password before doing create postgres? That way the replication user will be created with the right password during create postgres. It errors out if I do this before a create postgres:

$ PGDATA=~/failover/primary pg_autoctl config set replication.password 'h4ckm3m0r3'
16:30:10 557345 FATAL Expected configuration file does not exists: "/home/pivotal/.config/pg_autoctl/home/pivotal/failover/primary/pg_autoctl.cfg"
16:30:10 557345 WARN  HINT: Check your PGDATA setting: "/home/pivotal/failover/primary"

pg_autoctl do adduser replica after a pg_autoctl set replication.password currently fails as:

$ PG_AUTOCTL_DEBUG=1 pg_autoctl do primary adduser replica --pgdata ~/failover/primary                                                                                                                                                       
16:22:09 547693 INFO  The user "pgautofailover_replicator" already exists, skipping.

Currently, It seems that pg_autoctl set replication.password has no effect apart from writing the password to the config file.

Aug 28 '20 23:08 soumyadeep2007

Hi @soumyadeep2007, thanks for looking into this area in details.

As you can see in https://github.com/citusdata/pg_auto_failover/blob/master/tests/test_auth.py at the moment the setting of the password is not part of the pg_autoctl implementation. I think your proposal of having the command pg_autoctl config set replication.password also do ALTER USER pgautofailover_replicator PASSWORD ... sounds good.

There is a difficulty though: when the command is used, either pg_autoctl / Postgres are running, or they are not.

If they are running, changing the password is easy.
When Postgres is not running, all we can do is change the configuration file, and then it means that each time we start pg_autoctl we need to ALTER USER ... PASSWORD again in case it has changed (kind of cache invalidation). I'm okay with doing that.

Note that we can't have a copy of the password around in our local state file or something, because any other tool cool go do ALTER USER ... PASSWORD and we would have no way to guarantee that the password cache we have is still valid.

Still with this approach the user has to sync the replication password on every node. I think that's better than storing the password in the clear in the monitor database, though.

Now on secondary nodes, what needs to happen at pg_autoctl config set replication.password is different:

we need to validate that we can connect to the primary with the new password before accepting it, now that we have the pg_receivewal(pgdata, pg_ctl, replicationSource, "0/1", LOG_DEBUG) trick for that (see https://github.com/citusdata/pg_auto_failover/blob/master/src/bin/pg_autoctl/pgctl.c#L1449),
then we need to change the password in our configuration file,
then we need to change the primary_conninfo setting in the recovery setup,
and then we need to disconnect and reconnect to the new primary, which means restarting Postgres on the standby node.

I don't think we need to go to MAINTENANCE on the secondary servers where we restart. Because we need to change the password on the primary first, it might be that the secondary is not in a stable state anymore: if it had to reconnect, then it's now failing to reconnect because it still has the wrong password, and we still need a robust way to install the new password.

Again, if neither pg_autoctl nor Postgres are running, all we can do is change the password in our configuration file, and recompute the primary_conninfo in the Postgres setup at start-up, always.

Aug 31 '20 10:08 DimCitus

Hey @DimCitus ,

Thanks for the detailed steps!

I have a somewhat alternate idea. what if the user had a way to provide pgautofailover_replicator passwords if --auth is password based ({md5, password, scram-sha-256}) when they perform pg_autoctl create postgres?

One way to do that would be to ask for it in the PGPASSWORD/PGPASSFILE when running pg_autoctl create postgres.
Another way: As part of a larger feature: the user could specify a config file param to create postgres with any subset of config fields (one such field could be replication.password)

Personally, I think I would like the latter. This is because the PGPASSWORD/PGPASSFILE vars are drop-in replacements for the password connection param in connection URLs. Consequently, the end user may be confused between whether we are expecting replication.password or whether we are expecting the password for autoctl_node (this is kind of confusing when we are creating standbys today). And besides, letting the user specify a config file that overrides defaults at the create stage could have other benefits too.

If we don't opt for that and we stick with the idea of letting config set replication.password do the heavy lifting, let me know and I can take a stab at a PR following the steps you had mentioned, if you want. I can also do a PR for providing a config file at the create stage.

Sep 01 '20 03:09 soumyadeep2007

Hi @soumyadeep2007,

I think things are more complex than they look at first. The --auth option we provide is used to open the HBA for three different users: the monitor user, the replication user, and the --username user too. But we only have one PGPASSWORD in the environment.

Also, as we use libpq then the pgpassfile is used when a connection is established to either the monitor or an upstream primary server already. What you're saying that we do not implement would be parsing the password file and using the values in there to fill-in the user password's in the Postgres role entry, is that right? But then it's a chicken-and-egg problem where you need to create the pgpassfile before using pg_autoctl create commands, or action-from-a-distance that's hard to document and can be quite surprising when you don't expect it.

About your next idea and the pre-existing partial configuration, I believe that the way our code is written that's already possible. If you provision a pg_autoctl configuration before running any command, your file and the command line options are going to be merged together, right?

That said I don't understand how any of that is fixing the fact that the replication.password entry in the pg_autoctl config is not applied to the Postgres user, and those two need to be kept in sync manually where pg_autoctl could be automating that step. What am I missing here?

Sep 01 '20 09:09 DimCitus

About your next idea and the pre-existing partial configuration, I believe that the way our code is written that's already possible. If you provision a pg_autoctl configuration before running any command, your file and the command line options are going to be merged together, right?

That said I don't understand how any of that is fixing the fact that the replication.password entry in the pg_autoctl config is not applied to the Postgres user, and those two need to be kept in sync manually where pg_autoctl could be automating that step. What am I missing here?

I think ideally if we can provide this minimal configuration in ${path_to_primary_config}/pg_autoctl.cfg:

(Exhibit A)

[replication]
password = rephack

Then when we create postgres to create the primary, primary_create_replication_user will pick up the right password since replication.password would be set.

$ PG_AUTOCTL_DEBUG=1 pg_autoctl create postgres --pgdata /home/pivotal/failover/primary --pgport=8800 \
--monitor postgres://[email protected]:7800/pg_auto_failover?sslmode=prefer --auth trust --no-ssl \
--hostname 127.0.0.1

When I tried this I ran into errors such as: 09:19:25 7519 ERROR Failed to find section pg_autoctl in "/home/pivotal/.config/pg_autoctl/home/pivotal/failover/primary/pg_autoctl.cfg"

09:20:24 7530 ERROR Failed to find option postgresql.port in "/home/pivotal/.config/pg_autoctl/home/pivotal/failover/primary/pg_autoctl.cfg"

The minimal configuration that I converged to avoid ERRORs was:

(Exhibit B)

[pg_autoctl]
role = keeper
monitor = postgres://[email protected]:7800/pg_auto_failover?sslmode=prefer
formation = default
group = 0
name = node_1
hostname = 127.0.0.1
nodekind = standalone

[postgresql]
pgdata = /home/pivotal/failover/primary
pg_ctl = /usr/local/pgsql/bin/pg_ctl
dbname = postgres
port = 8800

[ssl]
active = 0
sslmode = prefer

[replication]
password = rephack

Even though I did provide options such as --pgport during create postgres, ERRORs were still thrown.

With Exhibit B though, things did work and pgautofailover_replicator was set up with the right password!

With this in mind, I as an end user, would like to be able to get things to work with just Exhibit A

Sep 01 '20 17:09 soumyadeep2007

Something we had in the very early prototype was a pg_autoctl create setup command that would only check things and create a configuration, but refrain from registering to the monitor and things like that. Then you could use pg_autoctl create postgres even without any argument and it would do the trick.

Given the current organisation of the code and protocols, I think it might be possible to come up with a pg_autoctl register postgres command that does the registration on the monitor and the local setup but refrains from doing the initdb and the local Postgres instance. You could then edit the local configuration file with pg_autoctl config set ... and only then do pg_autoctl create postgres.

I'm not sure about the level of complexity for this idea as far as end-users are concerned. On the other hand it's an advanced command, and could be documented/introduced that way. The alternative of parsing a partial file makes it complex to implement validation of the configuration file, because no option can be set to required in the code anymore, and I'm not sure I like that.

Again, say we achieve that, how does this fix the initial problem of syncing the replication password in the pg_autoctl configuration and in the Postgres database?

Users may change the replication password whenever they want, and given sync with Postgres, that can be just after the pg_autoctl create postgres command that leaves an empty password in there. It seems to me that all the other ideas we're exploring are making things more complex rather than easier to use and understand.

Sep 01 '20 17:09 DimCitus

I'm not sure about the level of complexity for this idea as far as end-users are concerned. On the other hand it's an advanced command, and could be documented/introduced that way. The alternative of parsing a partial file makes it complex to implement validation of the configuration file, because no option can be set to required in the code anymore, and I'm not sure I like that.

Instead of passing in a partial file, we could have something like this perhaps: pg_autoctl create postgres .. -c replication.password 'rephack' -c some_other_param some_other_param_value where -c could be used to provide any config param.

Anyway, you are right - the aforementioned ideas are a workaround and really avoid the question of "what if the user wants to change the replication.password after node creation"?

I'll take a stab at automating the steps for config set to create the password as you mentioned before.

Sep 01 '20 21:09 soumyadeep2007