Operator 1.14.0 and configAwsOrGcp.log_s3_bucket break cluster
TL;DR - compact explanation here https://github.com/zalando/postgres-operator/issues/2852#issuecomment-2656008949
First of all, sorry for long logs and unstructured message. To write clean issue you have to have at least some understanding of what happens, but I have no idea yet. I read release notes on 1.12, 1.13 and 1.14 and descide I can upgrade stright to 1.14.0. But...
few kilobytes of logs and perplexity
After upgrading postgres-operator 1.11.0 to 1.14.0 my clusters won't startup:
$ kubectl get postgresqls.acid.zalan.do -A
NAMESPACE NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
brandadmin-staging brandadmin-pg develop 16 1 100Gi 1 500Mi 429d SyncFailed
ga games-aggregator-pg games-aggregator 16 2 125Gi 1000m 512Mi 157d SyncFailed
payments payments-pg develop 16 1 20Gi 1 500Mi 457d Running
sprint-reports asana-automate-db sprint 16 1 25Gi 1 500Mi 358d Running
staging develop-postgresql develop 17 2 250Gi 1 2Gi 435d UpdateFailed
3 clusters successfully started with updated spilo image (payments-pg, asana-automate-db and develop-postgresql) and 2 - not (brandadmin-pg and games-aggregator-pg). Before I noticed not clusters are updated, I initialized upgrade 16 -> 17 on cluster develop-postgresql and it stuck with same symptoms (at first I thought it is this reason, but now I don't thinks so, see below):
2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
and no more logs.
Some clusters managed to start there is same error:
$ kubectl -n sprint-reports logs asana-automate-db-0
2025-01-23 15:38:54,983 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:55,040 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:55,043 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:55,191 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:55,192 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:55,775 - bootstrapping - INFO - Configuring pgqd
2025-01-23 15:38:55,776 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:55,777 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:55,778 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:55,779 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:55,780 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:55,781 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:55,782 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:55,793 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:55,793 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:55,794 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 15:38:55,808 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 15:38:55,808 - bootstrapping - INFO - Configuring patroni
2025-01-23 15:38:55,826 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 15:38:55,827 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 15:38:57,916 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 15:38:57,974 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 15:38:57,995 WARNING: Postgresql is not running.
2025-01-23 15:38:57,995 INFO: Lock owner: ; I am asana-automate-db-0
2025-01-23 15:38:58,000 INFO: pg_controldata:
After I delete this pod it stuck too!
Processes inside of failed clusters:
root@develop-postgresql-0:/home/postgres# ps ax
PID TTY STAT TIME COMMAND
1 ? Ss 0:00 /usr/bin/dumb-init -c --rewrite 1:0 -- /bin/sh /launch.sh
7 ? S 0:00 /bin/sh /launch.sh
20 ? S 0:00 /usr/bin/runsvdir -P /etc/service
21 ? Ss 0:00 runsv pgqd
22 ? S 0:00 /bin/bash /scripts/patroni_wait.sh --role primary -- /usr/bin/pgqd /home/postgres/pgq_ticker.ini
83 ? S 0:00 sleep 60
84 pts/0 Ss 0:00 bash
97 pts/0 R+ 0:00 ps ax
After one more deletion it is managed to start.
I notice one thing in the logs - sometimes container starts with WAL-E variables, sometimes - not. Operator shows its status as OK, but it's not:
$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:38:43,529 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:38:43,587 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:38:43,588 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:38:43,726 - bootstrapping - INFO - Configuring wal-e
2025-01-23 15:38:43,727 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 15:38:43,728 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 15:38:43,729 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 15:38:43,730 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 15:38:43,731 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 15:38:43,732 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 15:38:43,733 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 15:38:43,736 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 15:38:43,736 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:38:43,736 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:38:43,910 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:38:43,910 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:38:43,931 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do -A
NAMESPACE NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
brandadmin-staging brandadmin-pg develop 16 1 100Gi 1 500Mi 429d SyncFailed
ga games-aggregator-pg games-aggregator 16 2 125Gi 1000m 512Mi 157d SyncFailed
payments payments-pg develop 16 1 20Gi 1 500Mi 457d Running
sprint-reports asana-automate-db sprint 16 1 25Gi 1 500Mi 358d Running
staging develop-postgresql develop 17 2 250Gi 1 2Gi 435d UpdateFailed
$ kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted
$ kubectl -n brandadmin-staging get pod
NAME READY STATUS RESTARTS AGE
brand-admin-backend-api-7b7856c75-d2ktr 1/1 Running 0 22h
brand-admin-backend-api-7b7856c75-vczsg 1/1 Running 0 22h
brand-admin-backend-async-tasks-69c5876799-nm4nh 1/1 Running 0 22h
brandadmin-pg-0 1/2 Running 0 82s
$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 15:59:27,840 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 15:59:27,896 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 15:59:27,897 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 15:59:28,051 - bootstrapping - INFO - Configuring crontab
2025-01-23 15:59:28,053 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 15:59:28,070 - bootstrapping - INFO - Configuring certificate
2025-01-23 15:59:28,070 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 15:59:28,706 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
$ kubectl -n brandadmin-staging get pod brandadmin-pg-0
NAME READY STATUS RESTARTS AGE
brandadmin-pg-0 1/2 Running 0 81m
$ kubectl -n brandadmin-staging get postgresqls.acid.zalan.do brandadmin-pg
NAME TEAM VERSION PODS VOLUME CPU-REQUEST MEMORY-REQUEST AGE STATUS
brandadmin-pg develop 16 1 100Gi 1 500Mi 429d Running
While I wrote this issue passed like an hour or so, in despair I restarted this failed pod one more time and it STARTED (container postgres became Ready), but still not working:
kubectl -n brandadmin-staging delete pod brandadmin-pg-0
pod "brandadmin-pg-0" deleted
$ kubectl -n brandadmin-staging describe pod brandadmin-pg-0
Name: brandadmin-pg-0
Namespace: brandadmin-staging
Priority: 0
Service Account: postgres-pod
Node: pri-staging-wx2ci/10.106.0.35
Start Time: Thu, 23 Jan 2025 18:26:41 +0100
Labels: application=spilo
apps.kubernetes.io/pod-index=0
cluster-name=brandadmin-pg
controller-revision-hash=brandadmin-pg-5f65fc8dbd
spilo-role=master
statefulset.kubernetes.io/pod-name=brandadmin-pg-0
team=develop
Annotations: prometheus.io/path: /metrics
prometheus.io/port: 9187
prometheus.io/scrape: true
status:
{"conn_url":"postgres://10.244.2.104:5432/postgres","api_url":"http://10.244.2.104:8008/patroni","state":"running","role":"primary","versi...
Status: Running
IP: 10.244.2.104
IPs:
IP: 10.244.2.104
Controlled By: StatefulSet/brandadmin-pg
Containers:
postgres:
Container ID: containerd://d67d695d8bce177e07b0ec3c23efbe59cc5349cb81e95abea6ba6e913fe7d836
Image: ghcr.io/zalando/spilo-17:4.0-p2
Image ID: ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
Ports: 8008/TCP, 5432/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
State: Running
Started: Thu, 23 Jan 2025 18:26:46 +0100
Ready: True
Restart Count: 0
Limits:
cpu: 10
memory: 6Gi
Requests:
cpu: 1
memory: 500Mi
Readiness: http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
Environment:
SCOPE: brandadmin-pg
PGROOT: /home/postgres/pgdata/pgroot
POD_IP: (v1:status.podIP)
POD_NAMESPACE: brandadmin-staging (v1:metadata.namespace)
PGUSER_SUPERUSER: postgres
KUBERNETES_SCOPE_LABEL: cluster-name
KUBERNETES_ROLE_LABEL: spilo-role
PGPASSWORD_SUPERUSER: <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'> Optional: false
PGUSER_STANDBY: standby
PGPASSWORD_STANDBY: <set to the key 'password' in secret 'standby.brandadmin-pg.credentials.postgresql.acid.zalan.do'> Optional: false
PAM_OAUTH2: https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
HUMAN_ROLE: zalandos
PGVERSION: 16
KUBERNETES_LABELS: {"application":"spilo"}
SPILO_CONFIGURATION: {"postgresql":{"parameters":{"shared_buffers":"1536MB"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"4608MB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"384MB","max_connections":"100","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","work_mem":"7864kB"}},"failsafe_mode":true}}}
DCS_ENABLE_KUBERNETES_API: true
ALLOW_NOSSL: true
AWS_ACCESS_KEY_ID: xxxx
AWS_ENDPOINT: https://fra1.digitaloceanspaces.com
AWS_SECRET_ACCESS_KEY: xxxx
CLONE_AWS_ACCESS_KEY_ID: xxx
CLONE_AWS_ENDPOINT: https://fra1.digitaloceanspaces.com
CLONE_AWS_SECRET_ACCESS_KEY: xxxx
LOG_S3_ENDPOINT: https://fra1.digitaloceanspaces.com
WAL_S3_BUCKET: xxx-staging-db-wal
WAL_BUCKET_SCOPE_SUFFIX: /79c4fff8-6efb-477a-83bc-a43d34e8160a
WAL_BUCKET_SCOPE_PREFIX:
LOG_S3_BUCKET: xxx-staging-db-backups-all
LOG_BUCKET_SCOPE_SUFFIX: /79c4fff8-6efb-477a-83bc-a43d34e8160a
LOG_BUCKET_SCOPE_PREFIX:
Mounts:
/dev/shm from dshm (rw)
/home/postgres/pgdata from pgdata (rw)
/var/run/postgresql from postgresql-run (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
exporter:
Container ID: containerd://48c54ad6591eaf9e60aa92b3235cb4878900fb46e94aacfeedcb70465d005619
Image: quay.io/prometheuscommunity/postgres-exporter:latest
Image ID: quay.io/prometheuscommunity/postgres-exporter@sha256:6999a7657e2f2fb0ca6ebf417213eebf6dc7d21b30708c622f6fcb11183a2bb0
Port: 9187/TCP
Host Port: 0/TCP
State: Running
Started: Thu, 23 Jan 2025 18:26:47 +0100
Ready: True
Restart Count: 0
Limits:
cpu: 500m
memory: 256Mi
Requests:
cpu: 100m
memory: 200Mi
Environment:
POD_NAME: brandadmin-pg-0 (v1:metadata.name)
POD_NAMESPACE: brandadmin-staging (v1:metadata.namespace)
POSTGRES_USER: postgres
POSTGRES_PASSWORD: <set to the key 'password' in secret 'postgres.brandadmin-pg.credentials.postgresql.acid.zalan.do'> Optional: false
DATA_SOURCE_URI: 127.0.0.1:5432
DATA_SOURCE_USER: $(POSTGRES_USER)
DATA_SOURCE_PASS: $(POSTGRES_PASSWORD)
PG_EXPORTER_AUTO_DISCOVER_DATABASES: true
Mounts:
/home/postgres/pgdata from pgdata (rw)
/var/run/postgresql from postgresql-run (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9mghg (ro)
Conditions:
Type Status
PodReadyToStartContainers True
Initialized True
Ready True
ContainersReady True
PodScheduled True
Volumes:
pgdata:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: pgdata-brandadmin-pg-0
ReadOnly: false
dshm:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
postgresql-run:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
SizeLimit: <unset>
kube-api-access-9mghg:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
workloadKind=postgres:NoSchedule
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 22s default-scheduler Successfully assigned brandadmin-staging/brandadmin-pg-0 to pri-staging-wx2ci
Normal Pulled 18s kubelet Container image "ghcr.io/zalando/spilo-17:4.0-p2" already present on machine
Normal Created 18s kubelet Created container postgres
Normal Started 18s kubelet Started container postgres
Normal Pulling 18s kubelet Pulling image "quay.io/prometheuscommunity/postgres-exporter:latest"
Normal Pulled 17s kubelet Successfully pulled image "quay.io/prometheuscommunity/postgres-exporter:latest" in 455ms (455ms including waiting). Image size: 11070758 bytes.
Normal Created 17s kubelet Created container exporter
Normal Started 17s kubelet Started container exporter
$ kubectl -n brandadmin-staging logs brandadmin-pg-0
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 17:26:47,349 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 17:26:47,407 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 17:26:47,408 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 17:26:47,460 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 17:26:47,462 - bootstrapping - INFO - Configuring certificate
2025-01-23 17:26:47,463 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 17:26:47,768 - bootstrapping - INFO - Configuring patroni
2025-01-23 17:26:47,792 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 17:26:47,793 - bootstrapping - INFO - Configuring wal-e
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_PREFIX
2025-01-23 17:26:47,794 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_S3_PREFIX
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ACCESS_KEY_ID
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_SECRET_ACCESS_KEY
2025-01-23 17:26:47,795 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_S3_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_ENDPOINT
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DISABLE_S3_SSE
2025-01-23 17:26:47,796 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/AWS_S3_FORCE_PATH_STYLE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_DOWNLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALG_UPLOAD_CONCURRENCY
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/USE_WALG_RESTORE
2025-01-23 17:26:47,797 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/WALE_LOG_DESTINATION
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/PGPORT
2025-01-23 17:26:47,798 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/BACKUP_NUM_TO_RETAIN
2025-01-23 17:26:47,801 - bootstrapping - INFO - Writing to file /run/etc/wal-e.d/env/TMPDIR
2025-01-23 17:26:47,802 - bootstrapping - INFO - Configuring crontab
2025-01-23 17:26:47,803 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 17:26:47,816 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 17:26:47,817 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 17:26:47,817 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 17:26:47,817 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 17:26:47,818 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-01-23 17:26:49,683 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2025-01-23 17:26:49,754 INFO: No PostgreSQL configuration items changed, nothing to reload.
2025-01-23 17:26:49,774 WARNING: Postgresql is not running.
2025-01-23 17:26:49,775 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:26:49,781 INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202307071
Database system identifier: 7369539194529993100
Database cluster state: shut down
pg_control last modified: Thu Jan 23 17:32:16 2025
Latest checkpoint location: 5A/82000028
Latest checkpoint's REDO location: 5A/82000028
Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
Latest checkpoint's TimeLineID: 27
Latest checkpoint's PrevTimeLineID: 27
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:929334
Latest checkpoint's NextOID: 873526
Latest checkpoint's NextMultiXactId: 19
Latest checkpoint's NextMultiOffset: 37
Latest checkpoint's oldestXID: 717
Latest checkpoint's oldestXID's DB: 5
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 5
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Thu Jan 23 17:32:16 2025
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29
2025-01-23 17:32:36,148 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:36,326 INFO: starting as a secondary
2025-01-23 17:32:36 UTC [51]: [1-1] 67927d34.33 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:36 UTC [51]: [2-1] 67927d34.33 0 LOG: pg_stat_kcache.linux_hz is set to 125000
2025-01-23 17:32:36 UTC [51]: [3-1] 67927d34.33 0 FATAL: could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:36 UTC [51]: [4-1] 67927d34.33 0 LOG: database system is shut down
2025-01-23 17:32:36,971 INFO: postmaster pid=51
/var/run/postgresql:5432 - no response
2025-01-23 17:32:46,146 WARNING: Postgresql is not running.
2025-01-23 17:32:46,146 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,149 INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202307071
Database system identifier: 7369539194529993100
Database cluster state: shut down
pg_control last modified: Thu Jan 23 17:32:16 2025
Latest checkpoint location: 5A/82000028
Latest checkpoint's REDO location: 5A/82000028
Latest checkpoint's REDO WAL file: 0000001B0000005A00000082
Latest checkpoint's TimeLineID: 27
Latest checkpoint's PrevTimeLineID: 27
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID: 0:929334
Latest checkpoint's NextOID: 873526
Latest checkpoint's NextMultiXactId: 19
Latest checkpoint's NextMultiOffset: 37
Latest checkpoint's oldestXID: 717
Latest checkpoint's oldestXID's DB: 5
Latest checkpoint's oldestActiveXID: 0
Latest checkpoint's oldestMultiXid: 1
Latest checkpoint's oldestMulti's DB: 5
Latest checkpoint's oldestCommitTsXid: 0
Latest checkpoint's newestCommitTsXid: 0
Time of latest checkpoint: Thu Jan 23 17:32:16 2025
Fake LSN counter for unlogged rels: 0/3E8
Minimum recovery ending location: 0/0
Min recovery ending loc's timeline: 0
Backup start location: 0/0
Backup end location: 0/0
End-of-backup record required: no
wal_level setting: replica
wal_log_hints setting: on
max_connections setting: 100
max_worker_processes setting: 8
max_wal_senders setting: 10
max_prepared_xacts setting: 0
max_locks_per_xact setting: 64
track_commit_timestamp setting: off
Maximum data alignment: 8
Database block size: 8192
Blocks per segment of large relation: 131072
WAL block size: 8192
Bytes per WAL segment: 16777216
Maximum length of identifiers: 64
Maximum columns in an index: 32
Maximum size of a TOAST chunk: 1996
Size of a large-object chunk: 2048
Date/time type storage: 64-bit integers
Float8 argument passing: by value
Data page checksum version: 0
Mock authentication nonce: 389f9007f77b578836bfcee51eabb488b11042d00c48d0d84e718a826ce23d29
2025-01-23 17:32:46,162 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:46,190 INFO: starting as a secondary
2025-01-23 17:32:46 UTC [62]: [1-1] 67927d3e.3e 0 LOG: Auto detecting pg_stat_kcache.linux_hz parameter...
2025-01-23 17:32:46 UTC [62]: [2-1] 67927d3e.3e 0 LOG: pg_stat_kcache.linux_hz is set to 1000000
2025-01-23 17:32:46 UTC [62]: [3-1] 67927d3e.3e 0 FATAL: could not load server certificate file "/run/certs/server.crt": No such file or directory
2025-01-23 17:32:46 UTC [62]: [4-1] 67927d3e.3e 0 LOG: database system is shut down
2025-01-23 17:32:46,821 INFO: postmaster pid=62
/var/run/postgresql:5432 - no response
2025-01-23 17:32:56,143 WARNING: Postgresql is not running.
2025-01-23 17:32:56,144 INFO: Lock owner: ; I am brandadmin-pg-0
2025-01-23 17:32:56,146 INFO: pg_controldata:
pg_control version number: 1300
Catalog version number: 202307071
All my clusters consisting of two nodes can't start replica node: Probably problem is with WAL variables...
$ kubectl -n staging exec -it develop-postgresql-0 -- patronictl topology
Defaulted container "postgres" out of: postgres, exporter
+ Cluster: develop-postgresql (7369262358642845868) --------+----+-----------+
| Member | Host | Role | State | TL | Lag in MB |
+------------------------+--------------+---------+---------+----+-----------+
| develop-postgresql-0 | 10.244.0.253 | Leader | running | 39 | |
| + develop-postgresql-1 | | Replica | | | unknown |
+------------------------+--------------+---------+---------+----+-----------+
$ kubectl -n staging logs develop-postgresql-0 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:20:51,723 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:20:51,766 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:20:51,767 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:20:51,823 - bootstrapping - INFO - Configuring patroni
2025-01-23 16:20:51,846 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-01-23 16:20:51,847 - bootstrapping - INFO - Configuring pgbouncer
2025-01-23 16:20:51,847 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring pam-oauth2
2025-01-23 16:20:51,848 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-01-23 16:20:51,848 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:20:51,848 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:20:51,868 - bootstrapping - INFO - Configuring certificate
2025-01-23 16:20:51,868 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-01-23 16:20:53,422 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:20:53,423 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
$ kubectl -n staging exec -it develop-postgresql-0 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
chpst: fatal: unable to switch to directory: /run/etc/wal-e.d/env: file does not exist
$ kubectl -n staging logs develop-postgresql-1 | head -20
Defaulted container "postgres" out of: postgres, exporter
2025-01-23 16:38:15,383 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-01-23 16:38:15,424 - bootstrapping - INFO - No meta-data available for this provider
2025-01-23 16:38:15,424 - bootstrapping - INFO - Looks like you are running unsupported
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring bootstrap
2025-01-23 16:38:15,473 - bootstrapping - INFO - Configuring crontab
2025-01-23 16:38:15,473 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring standby-cluster
2025-01-23 16:38:15,482 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
$ kubectl -n staging exec -it develop-postgresql-1 -- tail /home/postgres/pgdata/pgroot/pg_log/postgresql-4.log
Defaulted container "postgres" out of: postgres, exporter
STRUCTURED: time=2025-01-23T16:30:08.235670-00 pid=8215 action=push-wal key=s3://xxx-staging-db-wal/spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/wal_005/00000026000006700000009C.lzo prefix=spilo/develop-postgresql/939ea78b-0caf-458f-a088-989352a97300/wal/16/ rate=18353.3 seg=00000026000006700000009C state=complete
2025-01-23 16:30:12 UTC [8234]: [5-1] 67926e94.202a 0 LOG: ending log output to stderr
2025-01-23 16:30:12 UTC [8234]: [6-1] 67926e94.202a 0 HINT: Future log output will go to log destination "csvlog".
ERROR: 2025/01/23 16:30:12.698764 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.204088 Archive '00000026000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:13.573033 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:13.845528 Archive '00000028.history' does not exist.
ERROR: 2025/01/23 16:30:14.117082 Archive '00000027.history' does not exist.
ERROR: 2025/01/23 16:30:14.478060 Archive '00000027000006700000009D' does not exist.
ERROR: 2025/01/23 16:30:14.807988 Archive '00000026000006700000009D' does not exist.
$ kubectl -n staging describe pod develop-postgresql-0
Name: develop-postgresql-0
Namespace: staging
Priority: 0
Service Account: postgres-pod
Node: pri-staging-wx2cv/10.106.0.46
Start Time: Thu, 23 Jan 2025 17:20:44 +0100
Labels: application=spilo
apps.kubernetes.io/pod-index=0
cluster-name=develop-postgresql
controller-revision-hash=develop-postgresql-5f869975bf
spilo-role=master
statefulset.kubernetes.io/pod-name=develop-postgresql-0
team=develop
Annotations: prometheus.io/path: /metrics
prometheus.io/port: 9187
prometheus.io/scrape: true
status:
{"conn_url":"postgres://10.244.0.253:5432/postgres","api_url":"http://10.244.0.253:8008/patroni","state":"running","role":"primary","versi...
Status: Running
IP: 10.244.0.253
IPs:
IP: 10.244.0.253
Controlled By: StatefulSet/develop-postgresql
Containers:
postgres:
Container ID: containerd://5004728ea5d71484a313b6124f2534a839da5ef0527427cec1942f135aa33e93
Image: ghcr.io/zalando/spilo-17:4.0-p2
Image ID: ghcr.io/zalando/spilo-17@sha256:23861da069941ff5345e6a97455e60a63fc2f16c97857da8f85560370726cbe7
Ports: 8008/TCP, 5432/TCP, 8080/TCP
Host Ports: 0/TCP, 0/TCP, 0/TCP
State: Running
Started: Thu, 23 Jan 2025 17:20:50 +0100
Ready: True
Restart Count: 0
Limits:
cpu: 10
memory: 13500Mi
Requests:
cpu: 1
memory: 2Gi
Readiness: http-get http://:8008/readiness delay=6s timeout=5s period=10s #success=1 #failure=3
Environment:
SCOPE: develop-postgresql
PGROOT: /home/postgres/pgdata/pgroot
POD_IP: (v1:status.podIP)
POD_NAMESPACE: staging (v1:metadata.namespace)
PGUSER_SUPERUSER: postgres
KUBERNETES_SCOPE_LABEL: cluster-name
KUBERNETES_ROLE_LABEL: spilo-role
PGPASSWORD_SUPERUSER: <set to the key 'password' in secret 'postgres.develop-postgresql.credentials.postgresql.acid.zalan.do'> Optional: false
PGUSER_STANDBY: standby
PGPASSWORD_STANDBY: <set to the key 'password' in secret 'standby.develop-postgresql.credentials.postgresql.acid.zalan.do'> Optional: false
PAM_OAUTH2: https://info.example.com/oauth2/tokeninfo?access_token= uid realm=/employees
HUMAN_ROLE: zalandos
PGVERSION: 17
KUBERNETES_LABELS: {"application":"spilo"}
SPILO_CONFIGURATION: {"postgresql":{"parameters":{"shared_buffers":"3GB","shared_preload_libraries":"bg_mon,pg_stat_statements,pgextwlist,pg_auth_mon,set_user,pg_cron,pg_stat_kcache,decoderbufs"}},"bootstrap":{"initdb":[{"auth-host":"md5"},{"auth-local":"trust"}],"dcs":{"postgresql":{"parameters":{"checkpoint_completion_target":"0.9","default_statistics_target":"100","effective_cache_size":"9GB","effective_io_concurrency":"200","hot_standby_feedback":"on","huge_pages":"off","jit":"false","maintenance_work_mem":"768MB","max_connections":"200","max_parallel_maintenance_workers":"4","max_parallel_workers":"8","max_parallel_workers_per_gather":"4","max_standby_archive_delay":"900s","max_standby_streaming_delay":"900s","max_wal_size":"4GB","max_worker_processes":"8","min_wal_size":"1GB","random_page_cost":"1.1","wal_buffers":"16MB","wal_level":"logical","work_mem":"4MB"}},"failsafe_mode":true}}}
DCS_ENABLE_KUBERNETES_API: true
ALLOW_NOSSL: true
AWS_ACCESS_KEY_ID: xxx
AWS_ENDPOINT: https://fra1.digitaloceanspaces.com
AWS_SECRET_ACCESS_KEY: xxx
CLONE_AWS_ACCESS_KEY_ID: xxx
CLONE_AWS_ENDPOINT: https://fra1.digitaloceanspaces.com
CLONE_AWS_SECRET_ACCESS_KEY: xxx
LOG_S3_ENDPOINT: https://fra1.digitaloceanspaces.com
WAL_S3_BUCKET: xxx-staging-db-wal
WAL_BUCKET_SCOPE_SUFFIX: /939ea78b-0caf-458f-a088-989352a97300
WAL_BUCKET_SCOPE_PREFIX:
LOG_S3_BUCKET: xxx-staging-db-backups-all
LOG_BUCKET_SCOPE_SUFFIX: /939ea78b-0caf-458f-a088-989352a97300
LOG_BUCKET_SCOPE_PREFIX:
Mounts:
It's complete mess!
Operator installed with Helm and terraform. Configured with ConfigMap:
resource "kubectl_manifest" "postgres-pod-config" {
yaml_body = <<-EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: postgres-pod-config
namespace: ${var.namespace}
data:
ALLOW_NOSSL: "true"
# WAL archiving and physical basebackups for PITR
AWS_ENDPOINT: ${local.s3_endpoint}
AWS_SECRET_ACCESS_KEY: ${local.s3_secret_key}
AWS_ACCESS_KEY_ID: ${local.s3_access_id}
# default values for cloning a cluster (same as above)
CLONE_AWS_ENDPOINT: ${local.clone_s3_endpoint}
CLONE_AWS_SECRET_ACCESS_KEY: ${local.clone_s3_secret_key}
CLONE_AWS_ACCESS_KEY_ID: ${local.clone_s3_access_id}
# send pg_logs to s3 (work in progress)
LOG_S3_ENDPOINT: ${local.s3_endpoint}
EOF
}
resource "helm_release" "postgres-operator" {
name = "postgres-operator"
namespace = var.namespace
chart = "postgres-operator"
repository = "https://opensource.zalando.com/postgres-operator/charts/postgres-operator"
version = "1.14.0"
depends_on = [kubectl_manifest.postgres-pod-config]
dynamic "set" {
for_each = var.wal_backup ? ["yes"] : []
content {
name = "configAwsOrGcp.wal_s3_bucket"
value = local.bucket_name_wal
}
}
dynamic "set" {
for_each = var.log_backup ? ["yes"] : []
content {
name = "configAwsOrGcp.log_s3_bucket"
value = "${var.name}-db-backups-all" # bucket with logical backups; 15 days ttl
}
}
set {
name = "configLogicalBackup.logical_backup_s3_access_key_id"
value = local.s3_access_id
}
set {
name = "configLogicalBackup.logical_backup_s3_bucket"
value = local.bucket_name_backups
}
set {
name = "configLogicalBackup.logical_backup_s3_region"
value = var.bucket_region
}
set {
name = "configLogicalBackup.logical_backup_s3_endpoint"
value = local.s3_endpoint
}
set {
name = "configKubernetes.pod_environment_configmap"
value = "${var.namespace}/postgres-pod-config"
}
set {
name = "configLogicalBackup.logical_backup_s3_secret_access_key"
value = local.s3_secret_key
}
values = [<<-YAML
configConnectionPooler:
connection_pooler_image: "registry.xxx.com/devops/postgres-zalando-pgbouncer:master-32"
configLogicalBackup:
logical_backup_docker_image: "registry.xxx.com/devops/postgres-logical-backup:0.6"
logical_backup_schedule: "32 8 * * *"
logical_backup_s3_retention_time: "2 week"
configKubernetes:
enable_pod_antiaffinity: true
# it doesn't influence pulling of images from public repos (like operator image) if there is no such secret
# but will help to fetch postgres-logical-backup image
pod_service_account_definition: |
apiVersion: v1
kind: ServiceAccount
metadata:
name: postgres-pod
imagePullSecrets:
- name: gitlab-registry-token
# became disabled by default since 1.9.0 https://github.com/zalando/postgres-operator/releases/tag/v1.9.0
# Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
enable_readiness_probe: true
# Quote: "We recommend enable_readiness_probe: true with pod_management_policy: parallel"
pod_management_policy: "parallel"
enable_sidecars: true
share_pgsocket_with_sidecars: true
custom_pod_annotations:
prometheus.io/scrape: "true"
prometheus.io/path: "/metrics"
prometheus.io/port: "9187"
configPatroni:
# https://patroni.readthedocs.io/en/master/dcs_failsafe_mode.html
enable_patroni_failsafe_mode: true
configGeneral:
sidecars:
- name: exporter
image: quay.io/prometheuscommunity/postgres-exporter:latest
ports:
- name: exporter
containerPort: 9187
protocol: TCP
resources:
limits:
cpu: 500m
memory: 256Mi
requests:
cpu: 100m
memory: 200Mi
env:
- name: DATA_SOURCE_URI
value: "127.0.0.1:5432"
- name: DATA_SOURCE_USER
value: "$(POSTGRES_USER)"
- name: DATA_SOURCE_PASS
value: "$(POSTGRES_PASSWORD)"
- name: PG_EXPORTER_AUTO_DISCOVER_DATABASES
value: "true"
YAML
]
}
I tracked down issue to this Helm chart setting: configAwsOrGcp.log_s3_bucket. Long ago I wanted to send logs to S3 storage but there were no possibility to specify custom endpoint at that time, so I put it off until later. Everything was OK until chart version 1.14.0 and spilo-17:4.0-p2.
Container startup without option:
2025-02-12 14:54:48,556 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-02-12 14:54:50,560 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2025-02-12 14:54:50,561 - bootstrapping - INFO - No meta-data available for this provider
2025-02-12 14:54:50,561 - bootstrapping - INFO - Looks like you are running local
2025-02-12 14:54:50,595 - bootstrapping - INFO - Configuring pgqd
2025-02-12 14:54:50,596 - bootstrapping - INFO - Configuring patroni
2025-02-12 14:54:50,601 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-02-12 14:54:50,601 - bootstrapping - INFO - Configuring pgbouncer
2025-02-12 14:54:50,601 - bootstrapping - INFO - No PGBOUNCER_CONFIGURATION was specified, skipping
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring wal-e
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring log
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring pam-oauth2
2025-02-12 14:54:50,602 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring crontab
2025-02-12 14:54:50,602 - bootstrapping - INFO - Skipping creation of renice cron job due to lack of SYS_NICE capability
2025-02-12 14:54:50,602 - bootstrapping - INFO - Configuring certificate
2025-02-12 14:54:50,602 - bootstrapping - INFO - Generating ssl self-signed certificate
2025-02-12 14:54:50,774 - bootstrapping - INFO - Configuring bootstrap
2025-02-12 14:54:50,774 - bootstrapping - INFO - Configuring standby-cluster
2025-02-12 14:54:52,191 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
Container startup with option:
2025-02-12 15:27:28,872 - bootstrapping - INFO - Figuring out my environment (Google? AWS? Openstack? Local?)
2025-02-12 15:27:30,874 - bootstrapping - INFO - Could not connect to 169.254.169.254, assuming local Docker setup
2025-02-12 15:27:30,875 - bootstrapping - INFO - No meta-data available for this provider
2025-02-12 15:27:30,875 - bootstrapping - INFO - Looks like you are running local
2025-02-12 15:27:30,897 - bootstrapping - INFO - Configuring patroni
2025-02-12 15:27:30,902 - bootstrapping - INFO - Writing to file /run/postgres.yml
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring pam-oauth2
2025-02-12 15:27:30,902 - bootstrapping - INFO - Writing to file /etc/pam.d/postgresql
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring bootstrap
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring wal-e
2025-02-12 15:27:30,902 - bootstrapping - INFO - Configuring log
Traceback (most recent call last):
File "/scripts/configure_spilo.py", line 1197, in <module>
main()
File "/scripts/configure_spilo.py", line 1159, in main
write_log_environment(placeholders)
File "/scripts/configure_spilo.py", line 794, in write_log_environment
tags = json.loads(os.getenv('LOG_S3_TAGS'))
File "/usr/lib/python3.10/json/__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not NoneType
2025-02-12 15:27:32,181 WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
So, we see configuration job failed and terminated here, configuration process incomplete, but container didn't terminated and postgres started anyway. It expected there is certificates, but they wasn't created and postgres failed to start:
2025-02-12 15:27:32 UTC [53]: [3-1] 67acbde4.35 0 FATAL: could not load server certificate file "/run/certs/server.crt": No such file or directory
Additional ENV variables for failed container (all other are same):
> LOG_BUCKET_SCOPE_PREFIX=
> LOG_BUCKET_SCOPE_SUFFIX=/b715f8ec-2584-41fd-892a-bda4cba3a5ff
LOG_ENV_DIR=/run/etc/log.d/env
> LOG_S3_BUCKET=ttt-db-backups-all
Not sure is it problem of only spilo or operator too - spilo shouldn't fail in such way, but probably it is operator provide incomplete or wrong settings to spilo container.
Another issue - spilo configure sections in some kind of random order, that's why some of my clusters start up successfully, but failed to restart - if certificates was created before log configuration - postgres starts. Sometimes logs break things too early - and container hangs indefinitely.
Complete logs and repeat steps: https://gist.github.com/baznikin/5d4f5d78613d3f333bd0a34fbd070433
Same problem for me
Set LOG_S3_TAGS to "{}" in operator configmap to solve it
Related to https://github.com/zalando/spilo/commit/fde34d4a8717c0f4cf09853cc8764567f0faa37f