temporal icon indicating copy to clipboard operation
temporal copied to clipboard

PostgreSQL schema update fails on busy Temporal instances

Open yurkeen opened this issue 1 year ago • 0 comments

We run schema update jobs during our deployment process. The command looks like the following

command: ["temporal-sql-tool", "--plugin", "postgres", "update-schema", "-d", "{{ $storeConfig.schemaPath }}"]

Expected Behavior

In the testing/staging environment where our Temporal instances don't have much workload and where the persistency and visibility databases are both relatively small, migrations run without any issues with the desired effect:

2024-07-11T13:38:03.670Z	INFO	UpdateSchemaTask started	{"config": {"DBName":"","TargetVersion":"","SchemaDir":"/etc/temporal/schema/postgresql/v12/visibility/versioned","SchemaName":"","IsDryRun":false}, "logging-call-at": "updatetask.go:104"}
2024-07-11T13:38:03.675Z	DEBUG	Schema Dirs: [v1.6]	{"logging-call-at": "updatetask.go:212"}
2024-07-11T13:38:03.675Z	INFO	Processing schema file: v1.6/fix_root_workflow_info.sql	{"logging-call-at": "updatetask.go:256"}
2024-07-11T13:38:03.675Z	DEBUG	running 1 updates for current version 1.5	{"logging-call-at": "updatetask.go:138"}
2024-07-11T13:38:03.675Z	DEBUG	---- Executing updates for version 1.6 ----	{"logging-call-at": "updatetask.go:157"}
2024-07-11T13:38:03.675Z	DEBUG	DROP INDEX by_root_workflow_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.068Z	DEBUG	DROP INDEX by_root_run_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.794Z	DEBUG	ALTER TABLE executions_visibility DROP COLUMN root_workflow_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.798Z	DEBUG	ALTER TABLE executions_visibility DROP COLUMN root_run_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.799Z	DEBUG	ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT '';	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.802Z	DEBUG	ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT '';	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.804Z	DEBUG	CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);	{"logging-call-at": "updatetask.go:159"}
...
2024-07-11T13:38:49.703Z	INFO	UpdateSchemaTask done	{"logging-call-at": "updatetask.go:127"}

Actual Behavior

For our production instances, however, which are quite active and where databases are in constant use (~1TB for persistence and ~120 Gb visibility), index creation fails constantly. Part of the equation is that we have deadlock detection mechanism and our PostgreSQL instances would terminate certain blocking queries running over a configured time limit. This is often the case when manipulating indices referencing busy tables:

2024-07-11T13:38:03.670Z	INFO	UpdateSchemaTask started	{"config": {"DBName":"","TargetVersion":"","SchemaDir":"/etc/temporal/schema/postgresql/v12/visibility/versioned","SchemaName":"","IsDryRun":false}, "logging-call-at": "updatetask.go:104"}
2024-07-11T13:38:03.675Z	DEBUG	Schema Dirs: [v1.6]	{"logging-call-at": "updatetask.go:212"}
2024-07-11T13:38:03.675Z	INFO	Processing schema file: v1.6/fix_root_workflow_info.sql	{"logging-call-at": "updatetask.go:256"}
2024-07-11T13:38:03.675Z	DEBUG	running 1 updates for current version 1.5	{"logging-call-at": "updatetask.go:138"}
2024-07-11T13:38:03.675Z	DEBUG	---- Executing updates for version 1.6 ----	{"logging-call-at": "updatetask.go:157"}
2024-07-11T13:38:03.675Z	DEBUG	DROP INDEX by_root_workflow_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.068Z	DEBUG	DROP INDEX by_root_run_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.794Z	DEBUG	ALTER TABLE executions_visibility DROP COLUMN root_workflow_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.798Z	DEBUG	ALTER TABLE executions_visibility DROP COLUMN root_run_id;	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.799Z	DEBUG	ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT '';	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.802Z	DEBUG	ALTER TABLE executions_visibility ADD COLUMN root_run_id VARCHAR(255) NOT NULL DEFAULT '';	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:04.804Z	DEBUG	CREATE INDEX by_root_workflow_id ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);	{"logging-call-at": "updatetask.go:159"}
2024-07-11T13:38:41.793Z	ERROR	Unable to update SQL schema.	{"error": "error executing statement: driver: bad connection", "logging-call-at": "handler.go:78"}

The error executing statement: driver: bad connection error is just an indication that query killer was engaged to terminate the query which blocks longer than the allowed limit.

Specifications

  • Version: 1.24.2 (and earlier)
  • Platform: N/A

Proposed Solution

The proposed solution for the issue should be creating the indices concurrently, so the CREATE INDEX requests are not blocking. For example schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql would have the following changes:

--- a/schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
+++ b/schema/postgresql/v12/visibility/versioned/v1.6/fix_root_workflow_info.sql
ALTER TABLE executions_visibility ADD COLUMN root_workflow_id VARCHAR(255) NOT NULL DEFAULT '';
ALTER TABLE executions_visibility ADD COLUMN root_run_id      VARCHAR(255) NOT NULL DEFAULT '';
- CREATE INDEX by_root_workflow_id  ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
- CREATE INDEX by_root_run_id       ON executions_visibility (namespace_id, root_run_id,      (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
+ CREATE INDEX CONCURRENTLY by_root_workflow_id  ON executions_visibility (namespace_id, root_workflow_id, (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);
+ CREATE INDEX CONCURRENTLY by_root_run_id       ON executions_visibility (namespace_id, root_run_id,      (COALESCE(close_time, '9999-12-31 23:59:59')) DESC, start_time DESC, run_id);

yurkeen avatar Jul 11 '24 15:07 yurkeen