pg_auto_failover React correctly to change in replication.password

Addresses #382 .

Followed the steps outlined in https://github.com/citusdata/pg_auto_failover/issues/382#issuecomment-683686687

Sep 11 '20 20:09 soumyadeep2007

Thank you for the detailed feedback! I will address it.

Sep 16 '20 21:09 soumyadeep2007

I rebased the branch against latest master and then made a separate commit (to be squashed later) to address the feedback so far.

I had a few questions and comments about the changes you requested and my latest commit to address those:

keeper_ensure_configuration already has logic to validate the conn string w/ the new password via IDENTIFY_SYSTEM, updating primary_conninfo and restarting the standby. So my commit just leverages that w/o any additional validation. I simply ensure that the replicationSource supplied to pg_setup_standby_mode has the right primary node info.
I have gotten rid of the changes to keeper_config_set_setting. Now the flow is that it will set the password in the config unhindered by validation (equivalent to someone setting it in the file at rest) -> reload signal -> password change will be acted upon on reload of the keeper config. By moving the handling of the change in replication password to keeper_ensure_configuration, we are accepting the change in replication password in the config file and in the keeper's in-memory config data structure (even if the password is "bad" in the case of a standby) - something I thought we originally wanted to guard with validation. Is this not a requirement? However, we do ensure that the bad password never makes it to the replication settings in the conf files. Maybe this isn't a bad pattern?
When we are in keeper_ensure_configuration, we have no guarantee that the server is running. If it is not running or there is any sort of error - such as a bad password, we retry the config reload later. Retries of reload is introduced in this config by looking at the return value of a reload hook and by returning false when we can't act upon a change in replication password or any other failure (from keeper_reload_configuration). Maybe this change is too invasive? Alternative approach: Instead, maybe we can have the retry logic by using a replication_password_changed global, set that in keeper_config_accept_new and use it to call keeper_ensure_configuration inside keeper_node_active? We can clear the global on a successful catalog update (for a primary) and after a connect->update_primary_conninfo->restart (for a standby) to break the retry cycle?
What should be the behavior when we supply a bad replication password to the standby? Should we infinitely retry or shutdown everything after a certain number of retries?

I somehow don't feel good about the changes. I probably have several gaps in understanding of the requirements.

Sep 22 '20 00:09 soumyadeep2007

Hi @soumyadeep2007 ; thanks for this new version of the PR. There's still more work to do. Let's also try to review the high level approach here:

keeper_ensure_configuration already has logic to validate the conn string w/ the new password via IDENTIFY_SYSTEM, updating primary_conninfo and restarting the standby. So my commit just leverages that w/o any additional validation. I simply ensure that the replicationSource supplied to pg_setup_standby_mode has the right primary node info.

Sounds good, thanks!

I have gotten rid of the changes to keeper_config_set_setting. Now the flow is that it will set the password in the config unhindered by validation (equivalent to someone setting it in the file at rest) -> reload signal -> password change will be acted upon on reload of the keeper config.

Thanks.

By moving the handling of the change in replication password to keeper_ensure_configuration, we are accepting the change in replication password in the config file and in the keeper's in-memory config data structure (even if the password is "bad" in the case of a standby) - something I thought we originally wanted to guard with validation. Is this not a requirement?

The validation we need is that, on the one hand the primary is going to accept the password, and the only way to ensure that is by running the ALTER ROLE SET PASSWORD command; and on the other hand that the secondary can connect to the primary when using the new password, and that's done in pg_setup_standby_mode as you note later, using IDENTIFY_SYSTEM.

The way it's now done, we have to read the new password from the configuration file anyway, before to impact the system. That's too late to invalidate the change. So I don't see the point in the guard anymore.

The guard would be useful before we set the password in the configuration, which could be done on the primary where we can just connect to the database from the pg_autoctl config set replication.password command. Then when the command is successful, we could update the configuration file. If we could be smart about using a Postgres transaction to sync the file edits with the password validation, that would be good then. See keeper_register_and_init for an example when we do that already.

I think that would be a better approach on the primary to implement the password change in the interactive process rather than signal the background process for a reload.

On a secondary server though, we have to ask the background process to orchestrate the change of the replication parameters for Postgres and restart Postgres. In Postgres 13 we might be able to just reload when the primary_conninfo is changed, that said, and we now have official support for Postgres 13, by the way.

Again, the validation on the standby could be done before we enter the reload signal processing on the background process. In the interactive process, we can try to connect using the password and the IDENTIFY_SYSTEM command and only change the configuration file and reload when that succeeds. I'm not completely sure that we want to impose an ordering in the password change here though.

When we are in keeper_ensure_configuration, we have no guarantee that the server is running. If it is not running or there is any sort of error - such as a bad password, we retry the config reload later. Retries of reload is introduced in this config by looking at the return value of a reload hook and by returning false when we can't act upon a change in replication password or any other failure (from keeper_reload_configuration). Maybe this change is too invasive?

Yeah I'm not sure I like the infinite-eternal-retry-reload thing.

Alternative approach: Instead, maybe we can have the retry logic by using a replication_password_changed global, set that in keeper_config_accept_new and use it to call keeper_ensure_configuration inside keeper_node_active? We can clear the global on a successful catalog update (for a primary) and after a connect->update_primary_conninfo->restart (for a standby) to break the retry cycle?

I don't like that approach either that said. The usual pattern to guard is: what if we restart pg_autoctl, then our internal global variable is not helping anymore. In the case of the reload, at least we broke from the hell loop.

What should be the behavior when we supply a bad replication password to the standby? Should we infinitely retry or shutdown everything after a certain number of retries?

Ideally we would reject it because the interactive command failed to connect to the primary. We would use a retry policy that continues retrying once in a while for a long time, so as to avoid requiring perfect timing of the operation on every node --- that's almost impossible to achieve unless you serialise, but then you need to figure out which node currently is the primary and cross your fingers hard that it doesn't change in the middle of the process.

I somehow don't feel good about the changes. I probably have several gaps in understanding of the requirements.

I am not certain how much this comment is going to help. Let's discuss more and maybe organise a meeting if you feel like it's necessary.

Oct 05 '20 16:10 DimCitus

@DimCitus

w/ @jchampio Addressed much of the feedback you had for us.

We haven't yet done anything to change the retry logic on the client side pg_autoctl set process (the default interactive retry policy kicks in) and the retry logic for the keeper config reload (now there are infinite retries)

Oct 22 '20 20:10 soumyadeep2007

Added a squash commit to the end of the patchset. This fixes an issue with our transaction handling in the primary case (the connection was previously being closed before a COMMIT could be correctly issued), adjusts test_auth with the new behavioral guarantees, and implements the duplicated logic for the primary and standby code paths.

Several action items mentioned above have changed considerably, so let's regroup. Here's where I think we are in the roadmap:

Primary/standby logic is no longer combined into a single code path.
Ping logic has been pulled inside connection establishment. It's not possible to remove the *unreachable output parameter from pgctl_identify_system() without a bigger API change, unfortunately, as the caller isn't required to have a PGSQL object ready by the time that function is called.
A password cache system has been set up, which I believe (but am not entirely sure) meets the three requirements set out here.
Testing is sparse. We're having trouble implementing race-free tests due to all of the moving pieces here (we have to get the tests to wait for state changes out of, and back to, the secondary state, and there aren't currently any test primitives for that).

Unfortunately I'm not able to mark conversations as resolved. 😁 Are there other outstanding items I've missed above that are still needed?

Nov 04 '20 19:11 jchampio