PITR fails due to not configurable timeout
Overview
When you do a point in time recovery there is a step where the database must be started through pg_ctl. If I understand correctly it happens so all WAL files from after the backup up to the point in time selected can be applied. The problem is that pg_ctl has a 60s timeout by default and if applying those logs takes longer the whole backup job fails.
Timeout can be set through a command line argument to pg_ctl or through PGCTLTIMEOUT environment variable (see docs for details). Neither is possible through the operator and job created cannot be updated with those as the template part of k8s job is immutable.
Additionally such an error leaves the PV used for restore in a bad state - there is a leftover postmaster.pid file from the server startup and all retries of the job end thinking the server is already running and failing.
Environment
Please provide the following details:
- Platform:
Kubernetes - Platform Version:
1.21.6 - PGO Image Tag:
ubi8-5.1.1-0 - Postgres Version:
11but doesn't really matter, logic is the same for all versions, comes from the operator - Storage: static local PV
Steps to Reproduce
REPRO
- Backup the database.
- Do lots of writes.
- Wait for archiving to finish.
- Perform PITR to a point after writes.
EXPECTED
PITR succeeds.
ACTUAL
PITR fails with server did not start in time error reported by pg_ctl after applying a subset of necessary WAL files.
Additional Information
There is another place where somebody else reported the same issue.
There are multiple solutions of varying complexity - where less configurable means also easier to implement. From the easiest one:
- Hardcode a higher timeout in
internal/pgbackrest/config.gocommand script inRestoreCommandfunction. - Allow customizing this timeout on Operator level.
- Allow customizing this timeout on cluster level (would require CRD update).
- Allow customizing environment variables of the restore job - the best but most complex solution, if implemented for other resources too it would fix #3054 and #2958.
@Agalin thank you for the detailed issue submission!
We are now tracking a fix for this issue in our backlog, and are looking at provide a way to allow the PGCTLTIMEOUT setting to be configured in an upcoming PGO release.
+1 Thanks @Agalin for filing this issue. We are facing the exact same problem here. Thank you as well for already providing manual solutions! 💯
I believe it should be fixed by this commit - released in 5.2.0 I think.
I think that I am hitting the same issue when creating a standby from a backup. I am running registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.0.5-0 and when creating the standby cluster I get the following logs from the pgbackrest restore job:
2022-10-11 13:55:50.600 GMT [73] LOG: restored log file "00000004000013CC00000063" from archive
2022-10-11 13:55:56.608 GMT [73] LOG: restored log file "00000004000013CC00000064" from archive
2022-10-11 13:56:01.548 GMT [73] LOG: restored log file "00000004000013CC00000065" from archive
2022-10-11 13:56:03.496 GMT [73] WARNING: xlog min recovery request 13CC/BF1C0CD0 is past current point 13CC/6598A5D0
2022-10-11 13:56:03.496 GMT [73] CONTEXT: writing block 1181 of relation base/16403/16548_fsm
WAL redo at 13CC/6598A578 for Btree/INSERT_LEAF: off 33; blkref #0: rel 1663/16403/109228623, blk 402900
2022-10-11 13:56:05.523 GMT [73] LOG: restored log file "00000004000013CC00000066" from archive
2022-10-11 13:56:08.823 GMT [73] WARNING: xlog min recovery request 13CC/BF1C0CD0 is past current point 13CC/669D3138
2022-10-11 13:56:08.823 GMT [73] CONTEXT: writing block 1181 of relation base/16403/16548_fsm
WAL redo at 13CC/669D30E0 for Btree/INSERT_LEAF: off 78; blkref #0: rel 1663/16403/109228623, blk 224018
2022-10-11 13:56:10.638 GMT [73] LOG: restored log file "00000004000013CC00000067" from archive
2022-10-11 13:56:14.326 GMT [73] LOG: restored log file "00000004000013CC00000068" from archive
2022-10-11 13:56:16.063 GMT [73] WARNING: xlog min recovery request 13CC/BF1C0CD0 is past current point 13CC/683E7588
2022-10-11 13:56:16.063 GMT [73] CONTEXT: writing block 1181 of relation base/16403/16548_fsm
WAL redo at 13CC/683E7530 for Btree/INSERT_LEAF: off 34; blkref #0: rel 1663/16403/109228623, blk 235859
pg_ctl: server did not start in time
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
NOTE: process will continue without log file.
Could you confirm that it's indeed the same problem? Should I update to 5.2?
Looks the same and should be fixed in 5.2 although I didn't have time to upgrade yet.
Updating to 5.2 seems like a big change. I wonder if I could try to copy the job definition and run the command manually with a pod that I create myself.
You can (that's exactly what I've done) but also need to remove the postmaster.pid file.
How did you manage to get the cluster creation to complete? I copied the pod definition from the restoration job and ran it manually with an increased timeout. Unfortunately, the operator is not aware of it and doesn't create the statefulset.
Unfortunately I don't remember the details.
No worries
I believe it should be fixed by this commit - released in 5.2.0 I think.
Since 5.2.0 is now available as Helm install, I could update the operator and can confirm that the timeout issue is fixed 👍🏻
I'm going to close this since (as posters have pointed out) there's a fix in more recent PGO versions and it has been tested. (Thanks for confirming that it works, @bauerjs1 !)