Cloning a Postgres Cluster with the pgbackrest option repo1-storage-verify-tls: "y" fails
Overview
For pgbackrest to be able to communicate with ou3 S3 (Netapp) Storage we have included the necessary certificates and have set repo1-storage-ca-file: "/etc/pgbackrest/conf.d/certs.pem" and repo1-storage-verify-tls: "y". I'm able to backup and restore our Postgres 13 Cluster without any problems with these two pgbackrest options.
But when I'm trying to clone a Postgres Cluster either from the running Postgres Cluster or cloning it from the backups stored in the S3 storage the restore pods will give the following error: ERROR: [095]: unable to set user-defined CA certificate location.
Once I set repo1-storage-verify-tls: "n", and apply the configuration again, the clone gets restored right away without any issues. Running regular backups and restores after, only work with keeping the repo1-storage-verify-tls set to"n". If setting the repo1-storage-verify-tls to"y", the backups will start failing with: ERROR: [095]: unable to set user-defined CA certificate location.
Environment
- Platform: Kubernetes in combination with Rancher
- Platform Version: Rancher 2.6.3 with Kubernetes v1.21.8-rancher2-1
- PGO Image Tag: ubi8-5.0.5-0
- Postgres Version: 13
- Storage: Trident & Netapp ONTAP
- Backup Storage: S3
Steps to Reproduce
REPRO
Provide steps to get to the error condition: Here is the Postgres Cluster definition that I'm using to clone a Postgres Cluster from S3 storage:
postgrescluster.yaml:
---
apiVersion: postgres-operator.crunchydata.com/v1beta1
kind: PostgresCluster
metadata:
name: clonedb
spec:
postgresVersion: 13
dataSource:
pgbackrest:
configuration:
- secret:
name: pgo-s3-creds
stanza: db
global:
repo1-path: "/pgbackrest/testdb"
repo1-s3-port: "443"
repo1-s3-uri-style: "path"
repo1-storage-ca-file: "/etc/pgbackrest/conf.d/certs.pem"
repo1-storage-verify-tls: "y"
repo:
name: repo1
s3:
bucket: "my-bucket"
endpoint: "my-endpoint"
region: "my-region"
options:
- --type=immediate
instances:
- name: instance1
replicas: 3
dataVolumeClaimSpec:
accessModes:
- "ReadWriteOnce"
resources:
requests:
storage: 1Gi
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
sidecars:
replicaCertCopy:
resources:
limits:
cpu: 100m
memory: 100Mi
requests:
cpu: 50m
memory: 50Mi
patroni:
dynamicConfiguration:
postgresql:
parameters:
timezone: 'Europe/Amsterdam'
backups:
pgbackrest:
configuration:
- secret:
name: pgo-s3-creds
repos:
- name: repo1
s3:
bucket: "my-bucket"
endpoint: "my-endpoint"
region: "my-region"
schedules:
full: "0 1 * * *"
incremental: "0 */1 * * *"
global:
repo1-retention-full: "5"
repo1-retention-full-type: time
repo1-s3-port: "443"
repo1-s3-uri-style: "path"
repo1-storage-ca-file: "/etc/pgbackrest/conf.d/certs.pem"
repo1-storage-verify-tls: "y"
repo1-type: "s3"
repo1-path: "/pgbackrest/clonedb"
manual:
repoName: repo1
options:
- --type=full
kustomization.yaml:
namespace: my-cluster
secretGenerator:
- name: pgo-s3-creds
files:
- s3-ca-certificate.pem
- s3.conf
generatorOptions:
disableNameSuffixHash: true
resources:
- postgrescluster.yaml
s3.conf:
[global]
repo1-s3-key=my-key
repo1-s3-key-secret=my-key-secret
And of course a file name s3-ca-certificate.pem with the certificates in it.
Apply the above code with:
kubectl apply -n y-cluster -k .
This will result in the restore pod throwing errors.
Setting repo1-storage-verify-tls to "n" within the dataSource part, will make the restore pod run and restore the Postgres Cluster. But when starting a backup on this new cluster, will fail again.
Setting repo1-storage-verify-tls to "n" within the backups part as well, will allow creating backups again of this new Postgres Cluster.
EXPECTED
The new Postgres Cluster that has been created by cloning from S3 storage to be up and running with and pgbackrest accepting the included certificates and repo1-storage-verify-tls: "y" It is also to be expected that backups can be made succesfully with pgbackrest.
ACTUAL
The restore pod that gets deployed to the namespace is showing errors about not able to find the CA certificate location, leading to many different restore pods showing the same error and no working Postgres Cluster.
Logs
Log of the restore pod when repo1-storage-verify-tls is set to "y" within the dataSource and backups part of the postgrescluster,yaml
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
NOTE: process will continue without log file.
WARN: --delta or --force specified but unable to find 'PG_VERSION' or 'backup.manifest' in '/pgdata/pg13' to confirm that this is a valid $PGDATA directory. --delta and --force have been disabled and if any files exist in the destination directories the restore will be aborted.
ERROR: [095]: unable to set user-defined CA certificate location: [33558530] No such file or directory
Log of the postgres operator when trying to backup after repo1-storage-verify-tls is set to "n" within the dataSource and repo1-storage-verify-tls is set to "y" within the backups part of the postgrescluster,yaml
time="2022-04-08T08:35:52Z" level=error msg="unable to create stanza" error="command terminated with exit code 95: ERROR: [095]: unable to set user-defined CA certificate location: [33558530] No such file or directory\n" file="internal/controller/postgrescluster/pgbackrest.go:2578" func="postgrescluster.(*Reconciler).reconcileStanzaCreate" name=clonedb namespace=dba-postgrescluster-clonedb-s3 reconciler=pgBackRest reconciler group=postgres-operator.crunchydata.com reconciler kind=PostgresCluster version=5.0.5-0
time="2022-04-08T08:35:52Z" level=debug msg=Warning file="sigs.k8s.io/[email protected]/pkg/internal/recorder/recorder.go:98" func="recorder.(*Provider).getBroadcaster.func1.1" message="command terminated with exit code 95: ERROR: [095]: unable to set user-defined CA certificate location: [33558530] No such file or directory\n" object="{PostgresCluster dba-postgrescluster-clonedb-s3 clonedb 17c02c9d-ab83-49bd-9a2e-737a6a2764bd postgres-operator.crunchydata.com/v1beta1 21792696 }" reason=UnableToCreateStanzas version=5.0.5-0
Can you please help me out on fixing this issue, as we would like to be able to keep using pgbackrest with TLS validation on.
Having a look at your kustomization file and your config makes me wondering about missmatching certificate names:
kustomization.yaml generates - s3-ca-certificate.pem
cluster config uses: repo1-storage-ca-file: "/etc/pgbackrest/conf.d/certs.pem"
should not both be the same?