PostgreSQL schema update fails on busy Temporal instances
We run schema update jobs during our deployment process. The command looks like the following
command: ["temporal-sql-tool", "--plugin", "postgres", "update-schema", "-d", "{{ $storeConfig.schemaPath }}"]
Expected Behavior
In the testing/staging environment where our Temporal instances don't have much workload and where the persistency and visibility databases are both relatively small, migrations run without any issues with the desired effect:
2024-07-11T13:38:03.670Z INFO UpdateSchemaTask started {"config": {"DBName":"","TargetVersion":"","SchemaDir":"/etc/temporal/schema/postgresql/v12/visibility/versioned","SchemaName":"","IsDryRun":false}, "logging-call-at": "updatetask.go:104"}
2024-07-11T13:38:03.675Z DEBUG Schema Dirs: [v1.6] {"logging-call-at": "updatetask.go:212"}
2024-07-11T13:38:03.675Z INFO Processing schema file: v1.6/fix_root_workflow_info.sql {"logging-call-at": "updatetask.go:256"}
2024-07-11T13:38:03.675Z DEBUG running 1 updates for current version 1.5 {"logging-call-at": "updatetask.go:138"}
2024-07-11T13:38:03.675Z DEBUG ---- Executing updates for version 1.6 ---- {"logging-call-at": "updatetask.go:157"}
2024-07-11T13:38:03.675Z DEBUG DROP INDEX by_root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.068Z DEBUG DROP INDEX by_root_run_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.794Z DEBUG ALTER TABLE executions_visibility DROP COLUMN root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.798Z DEBUG ALTER TABLE executions_visibility DROP COLUMN root_run_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.799Z DEBUG ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT ''; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.802Z DEBUG ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT ''; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.804Z DEBUG CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id); {"logging-call-at": "updatetask.go:159"}
...
2024-07-11T13:38:49.703Z INFO UpdateSchemaTask done {"logging-call-at": "updatetask.go:127"}
Actual Behavior
For our production instances, however, which are quite active and where databases are in constant use (~1TB for persistence and ~120 Gb visibility), index creation fails constantly. Part of the equation is that we have deadlock detection mechanism and our PostgreSQL instances would terminate certain blocking queries running over a configured time limit. This is often the case when manipulating indices referencing busy tables:
2024-07-11T13:38:03.670Z INFO UpdateSchemaTask started {"config": {"DBName":"","TargetVersion":"","SchemaDir":"/etc/temporal/schema/postgresql/v12/visibility/versioned","SchemaName":"","IsDryRun":false}, "logging-call-at": "updatetask.go:104"}
2024-07-11T13:38:03.675Z DEBUG Schema Dirs: [v1.6] {"logging-call-at": "updatetask.go:212"}
2024-07-11T13:38:03.675Z INFO Processing schema file: v1.6/fix_root_workflow_info.sql {"logging-call-at": "updatetask.go:256"}
2024-07-11T13:38:03.675Z DEBUG running 1 updates for current version 1.5 {"logging-call-at": "updatetask.go:138"}
2024-07-11T13:38:03.675Z DEBUG ---- Executing updates for version 1.6 ---- {"logging-call-at": "updatetask.go:157"}
2024-07-11T13:38:03.675Z DEBUG DROP INDEX by_root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.068Z DEBUG DROP INDEX by_root_run_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.794Z DEBUG ALTER TABLE executions_visibility DROP COLUMN root_workflow_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.798Z DEBUG ALTER TABLE executions_visibility DROP COLUMN root_run_id; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.799Z DEBUG ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT ''; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.802Z DEBUG ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT ''; {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.804Z DEBUG CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id); {"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:41.793Z ERROR Unable to update SQL schema. {"error": "error executing statement: driver: bad connection", "logging-call-at": "handler.go:78"}
The error executing statement: driver: bad connection error is just an indication that query killer was engaged to terminate the query which blocks longer than the allowed limit.
Specifications
- Version: 1.24.2 (and earlier)
- Platform: N/A
Proposed Solution
The proposed solution for the issue should be creating the indices concurrently, so the CREATE INDEX requests are not blocking. For example schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql would have the following changes:
--- a/schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
+++ b/schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT '';
ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT '';
- CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
- CREATE INDEX by_root_run_id ON executions_visibility (namespace_id, root_run_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
+ CREATE INDEX CONCURRENTLY by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
+ CREATE INDEX CONCURRENTLY by_root_run_id ON executions_visibility (namespace_id, root_run_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);