postgres-operator Multiple Backup Repo doesn't include base backup to repo2 when configured for S3

Overview

When configuring a new cluster's pgbackrest with multiple repos - repo1 being local and repo2 being s3 - the s3 repo does not receive a copy of the base backup.

Environment

Please provide the following details:

Platform: Kubernetes
Platform Version: 1.24.4
PGO Image Tag: crunchy-postgres:ubi8-14.5-0, crunchy-pgbackrest:ubi8-2.40-0
Postgres Version: 14
Storage: local-path and/or block storage

Steps to Reproduce

REPRO

Provide steps to get to the error condition:

Using kustomize/myconfig/postgres.yaml (based off multi-backup-repo example combined with ha example). Note that repo1 is local and repo2 is remote s3:

kind: PostgresCluster
metadata:
  name: postgres1
spec:
  image: registry.developers.crunchydata.com/crunchydata/crunchy-postgres:ubi8-14.5-0
  postgresVersion: 14
  instances:
    - name: pgha1
      replicas: 2
      dataVolumeClaimSpec:
        accessModes:
        - "ReadWriteOnce"
        resources:
          requests:
            storage: 1Gi
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 1
              podAffinityTerm:
                topologyKey: kubernetes.io/hostname
                labelSelector:
                  matchLabels:
                    postgres-operator.crunchydata.com/cluster: postgres1
                    postgres-operator.crunchydata.com/instance-set: pgha1
  backups:
    pgbackrest:
      image: registry.developers.crunchydata.com/crunchydata/crunchy-pgbackrest:ubi8-2.40-0
      configuration:
      - secret:
          name: postgres1-creds
      global:
        repo2-path: /pgbackrest/postgres-operator/postgres1
      repos:
      - name: repo1
        volume:
          volumeClaimSpec:
            accessModes:
            - "ReadWriteOnce"
            resources:
              requests:
                storage: 1Gi
      - name: repo2
        s3:
          bucket: "my-s3-bucket"
          endpoint: "s3.amazonaws.com"
          region: "us-east-1"

Run kubectl apply -k kustomize/myconfig
Wait for cluster to come up and the base backup to complete.

EXPECTED

The local and s3 repo ./archive dirs will show a backup has occurred ('*.backup' file will be created), and the ./backup dirs will contain the base backup manifest and files.

ACTUAL

The local and s3 repo ./archive dirs will show a backup has occurred ('*.backup' file will be created), the local repo './backup' will contain the base backup and manifest files, but the s3 repo ./backup dir shows no sign of the base backup. This makes the s3 repo unusable for replication or a restore.

Logs

Logs from the backup job:

time="2022-09-13T19:58:44Z" level=info msg="crunchy-pgbackrest starts"
time="2022-09-13T19:58:44Z" level=info msg="debug flag set to false"
time="2022-09-13T19:58:44Z" level=info msg="backrest backup command requested"
time="2022-09-13T19:58:44Z" level=info msg="command to execute is [pgbackrest backup --stanza=db --repo=1]"
time="2022-09-13T19:59:11Z" level=info msg="output=[]"
time="2022-09-13T19:59:11Z" level=info msg="stderr=[WARN: option 'repo1-retention-full' is not set for 'repo1-retention-full-type=count', the repository may run out of space\n      HINT: to retain full backups indefinitely (without warning), set option 'repo1-retention-full' to the maximum.\nWARN: option 'repo2-retention-full' is not set for 'repo2-retention-full-type=count', the repository may run out of space\n      HINT: to retain full backups indefinitely (without warning), set option 'repo2-retention-full' to the maximum.\nWARN: no prior backup exists, incr backup has been changed to full\n]"
time="2022-09-13T19:59:11Z" level=info msg="crunchy-pgbackrest ends"

Additional Information

When only an s3 repo is configured, the backup is stored correctly. I have not tested the inverse scenario of s3 as repo1 and local as repo2.

Sep 14 '22 03:09 pinnymz

The issue seems to be occurring due to the way this command is formulated:

	// Reconcile the initial backup that is needed to enable replica creation using pgBackRest.
	// This is done once stanza creation is successful
	if err := r.reconcileReplicaCreateBackup(ctx, postgresCluster, instances,
		repoResources.replicaCreateBackupJobs, sa, configHash, replicaCreateRepo); err != nil {
		log.Error(err, "unable to reconcile replica creation backup")
		result = updateReconcileResult(result, reconcile.Result{Requeue: true})
	}

Note that replicaCreateRepo will always be the last-volume mounted repo (if it exists) being returned by r.reconcileRepos() called earlier. However, for multirepo deployments, this probably needs to be empty since all repos will need the base backup (and the --repo switch should be left out of the backup command). Am I following this incorrectly?

Sep 14 '22 16:09 pinnymz

This is expected. You'll need to schedule backups for your repositories as described in https://access.crunchydata.com/documentation/postgres-operator/latest/tutorial/backup-management/

Sep 14 '22 16:09 cbandy

@cbandy I see how this can be done, and appreciate your response. But the documentation on that page is misleading:

PGO sets up your Postgres clusters so that they are continuously archiving the write-ahead log: your data is constantly being stored in your backup repository. Effectively, this is a backup!

However, in a disaster recovery scenario, you likely want to get your Postgres cluster back up and running as quickly as possible (e.g. a short “recovery time objective (RTO)”). What helps accomplish this is to take periodic backups. This makes it faster to restore!

The wording seems to imply that scheduled backups are a best practice, a 'nice to have' if you will. But since the base backup will not occur on external repos - the continuous archive of an external repo does not (yet) act as a backup! A scheduled backup should be considered a baseline requirement, and not just to make it 'faster to restore'.

However, given that a base backup is already happening as part of any cluster deployment, would it not make sense to make this happen for each repo? This would make the behavior consistent with the documented intent, and avoid surprising the user with the variance in behavior between local and remote repos.

Sep 14 '22 16:09 pinnymz

As described above, the behavior described in this issue is expected. More specifically, backups need to be scheduled for the various pgBackRest repositories, as described in the following comment above: https://github.com/CrunchyData/postgres-operator/issues/3381#issuecomment-1247018766.

If you have any additional questions about scheduled backups and/or anything else related to Disaster Recovery within Crunchy Postgres for Kubernetes, please feel free to reach out via the PGO project community discord server.

Mar 06 '24 16:03 andrewlecuyer