Consider dropping data directory during recovery via pg_basebackup
When a failover happens, it does not always succeed to pg_rewind the old primary. Instead if has a fallback to recover via pg_basebackup. This is great!
However, once the database becomes bigger in size than 50% of the available diskspace (give or take, inodes could cause other issues) a pg_basebackup might not succeed without operator intervention.
Instead it would be great if pg_auto_failover has an option where, instead of retaining the old database directory, it would delete this directory to ensure enough space is available on the node before initiating a pg_basebackup.
Alternatively we could go as far as designing a tristate for this setting:
- don't ever delete old directory, that is, until the backup is completely transferred and we swap the directory
- delete old directory when space is being contested
- delete old directory before initiating a restore via pg_basebackup
(maybe even a 4th state where the old data directory is retained till manually deleted - or an other failover happens - so we can perform diagnostics on why pg_rewind failed).
For many installations a delete old directory before initiating a restore via pg_basebackup is a very sensible option. If rewind failed pg_auto_failover will copy a fresh copy of the data directory over and configures it as a secondary. This ensures the system always keeps running without an operator needing to ensure enough space is available on the data drive under most circumstances.
See also #853 that lead us to using pg_basebackup tar format (maybe even tar.gz) when fetching the data, prior to swapping it in PGDATA. It makes the reasoning about necessary disk space more complex in a way, because now we might still need to have both the “download” area and the “production” area used at the same time for a while.
Given the following in our function pg_basebackup https://github.com/citusdata/pg_auto_failover/blob/d7997ffc3f1209483a37fe7e8ed49fe7a000f664/src/bin/pg_autoctl/pgctl.c#L1280 I would say that https://github.com/citusdata/pg_auto_failover/pull/870 indeed fixed this.