postgres-operator PITR fails due to not configurable timeout

Overview

When you do a point in time recovery there is a step where the database must be started through pg_ctl. If I understand correctly it happens so all WAL files from after the backup up to the point in time selected can be applied. The problem is that pg_ctl has a 60s timeout by default and if applying those logs takes longer the whole backup job fails.

Timeout can be set through a command line argument to pg_ctl or through PGCTLTIMEOUT environment variable (see docs for details). Neither is possible through the operator and job created cannot be updated with those as the template part of k8s job is immutable.

Additionally such an error leaves the PV used for restore in a bad state - there is a leftover postmaster.pid file from the server startup and all retries of the job end thinking the server is already running and failing.

Environment

Please provide the following details:

Platform: Kubernetes
Platform Version: 1.21.6
PGO Image Tag: ubi8-5.1.1-0
Postgres Version: 11 but doesn't really matter, logic is the same for all versions, comes from the operator
Storage: static local PV

Steps to Reproduce

REPRO

Backup the database.
Do lots of writes.
Wait for archiving to finish.
Perform PITR to a point after writes.

EXPECTED

PITR succeeds.

ACTUAL

PITR fails with server did not start in time error reported by pg_ctl after applying a subset of necessary WAL files.

Additional Information

There is another place where somebody else reported the same issue.

There are multiple solutions of varying complexity - where less configurable means also easier to implement. From the easiest one:

Hardcode a higher timeout in internal/pgbackrest/config.go command script in RestoreCommand function.
Allow customizing this timeout on Operator level.
Allow customizing this timeout on cluster level (would require CRD update).
Allow customizing environment variables of the restore job - the best but most complex solution, if implemented for other resources too it would fix #3054 and #2958.

Jun 30 '22 07:06 Agalin

@Agalin thank you for the detailed issue submission!

We are now tracking a fix for this issue in our backlog, and are looking at provide a way to allow the PGCTLTIMEOUT setting to be configured in an upcoming PGO release.

Jul 13 '22 03:07 andrewlecuyer

+1 Thanks @Agalin for filing this issue. We are facing the exact same problem here. Thank you as well for already providing manual solutions! 💯

Sep 02 '22 08:09 jobo-prod

I believe it should be fixed by this commit - released in 5.2.0 I think.

Sep 02 '22 09:09 Agalin

I think that I am hitting the same issue when creating a standby from a backup. I am running registry.developers.crunchydata.com/crunchydata/postgres-operator:ubi8-5.0.5-0 and when creating the standby cluster I get the following logs from the pgbackrest restore job:

2022-10-11 13:55:50.600 GMT [73] LOG:  restored log file "00000004000013CC00000063" from archive
2022-10-11 13:55:56.608 GMT [73] LOG:  restored log file "00000004000013CC00000064" from archive
2022-10-11 13:56:01.548 GMT [73] LOG:  restored log file "00000004000013CC00000065" from archive
2022-10-11 13:56:03.496 GMT [73] WARNING:  xlog min recovery request 13CC/BF1C0CD0 is past current point 13CC/6598A5D0
2022-10-11 13:56:03.496 GMT [73] CONTEXT:  writing block 1181 of relation base/16403/16548_fsm
	WAL redo at 13CC/6598A578 for Btree/INSERT_LEAF: off 33; blkref #0: rel 1663/16403/109228623, blk 402900
2022-10-11 13:56:05.523 GMT [73] LOG:  restored log file "00000004000013CC00000066" from archive
2022-10-11 13:56:08.823 GMT [73] WARNING:  xlog min recovery request 13CC/BF1C0CD0 is past current point 13CC/669D3138
2022-10-11 13:56:08.823 GMT [73] CONTEXT:  writing block 1181 of relation base/16403/16548_fsm
	WAL redo at 13CC/669D30E0 for Btree/INSERT_LEAF: off 78; blkref #0: rel 1663/16403/109228623, blk 224018
2022-10-11 13:56:10.638 GMT [73] LOG:  restored log file "00000004000013CC00000067" from archive
2022-10-11 13:56:14.326 GMT [73] LOG:  restored log file "00000004000013CC00000068" from archive
2022-10-11 13:56:16.063 GMT [73] WARNING:  xlog min recovery request 13CC/BF1C0CD0 is past current point 13CC/683E7588
2022-10-11 13:56:16.063 GMT [73] CONTEXT:  writing block 1181 of relation base/16403/16548_fsm
	WAL redo at 13CC/683E7530 for Btree/INSERT_LEAF: off 34; blkref #0: rel 1663/16403/109228623, blk 235859
pg_ctl: server did not start in time
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
      NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
       HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
       HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
      NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
       HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
       HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
      NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
       HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
       HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
      NOTE: process will continue without log file.
ERROR: [038]: unable to restore while PostgreSQL is running
       HINT: presence of 'postmaster.pid' in '/pgdata/pg14' indicates PostgreSQL is running.
       HINT: remove 'postmaster.pid' only if PostgreSQL is not running.
WARN: unable to open log file '/pgdata/pgbackrest/log/db-restore.log': No such file or directory
      NOTE: process will continue without log file.

Could you confirm that it's indeed the same problem? Should I update to 5.2?

Oct 11 '22 14:10 benjamin-bergia

Looks the same and should be fixed in 5.2 although I didn't have time to upgrade yet.

Oct 11 '22 14:10 Agalin

Updating to 5.2 seems like a big change. I wonder if I could try to copy the job definition and run the command manually with a pod that I create myself.

Oct 11 '22 14:10 benjamin-bergia

You can (that's exactly what I've done) but also need to remove the postmaster.pid file.

Oct 11 '22 15:10 Agalin

How did you manage to get the cluster creation to complete? I copied the pod definition from the restoration job and ran it manually with an increased timeout. Unfortunately, the operator is not aware of it and doesn't create the statefulset.

Oct 13 '22 08:10 benjamin-bergia

Unfortunately I don't remember the details.

Oct 13 '22 08:10 Agalin

No worries

Oct 13 '22 09:10 benjamin-bergia

I believe it should be fixed by this commit - released in 5.2.0 I think.

Since 5.2.0 is now available as Helm install, I could update the operator and can confirm that the timeout issue is fixed 👍🏻

Oct 20 '22 09:10 jobo-prod

I'm going to close this since (as posters have pointed out) there's a fix in more recent PGO versions and it has been tested. (Thanks for confirming that it works, @bauerjs1 !)

Jul 07 '23 17:07 benjaminjb