temporal Error during VisibilityDeleteExecution

We getting extreme amount of logs from temporal server:

{"level":"error","ts":"2024-12-15T16:21:36.296Z","msg":"Operation failed with an error.","error":"context deadline exceeded","logging-call-at":"visiblity_manager_metrics.go:264","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric\n\t/home/builder/temporal/common/persistence/visibility/visiblity_manager_metrics.go:264\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).DeleteWorkflowExecution\n\t/home/builder/temporal/common/persistence/visibility/visiblity_manager_metrics.go:128\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processDeleteExecution\n\t/home/builder/temporal/service/history/visibility_queue_task_executor.go:494\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).Execute\n\t/home/builder/temporal/service/history/visibility_queue_task_executor.go:122\ngo.temporal.io/server/service/history/queues.(*executableImpl).Execute\n\t/home/builder/temporal/service/history/queues/executable.go:236\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:223\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:119\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:145\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:233\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}
{"level":"error","ts":"2024-12-15T16:21:36.304Z","msg":"Fail to process task","shard-id":1,"address":"127.0.0.1:7234","component":"visibility-queue-processor","wf-namespace-id":"064f58ee-d88c-4c7c-8b81-77b93c315829","wf-id":"*","wf-run-id":"f4dd4001-fdbd-44d7-aaf1-9c401226e546","queue-task-id":23085605,"queue-task-visibility-timestamp":"2024-12-14T13:07:44.404Z","queue-task-type":"VisibilityDeleteExecution","queue-task":{"NamespaceID":"064f58ee-d88c-4c7c-8b81-77b93c315829","WorkflowID":"*","RunID":"f4dd4001-fdbd-44d7-aaf1-9c401226e546","VisibilityTimestamp":"2024-12-14T13:07:44.404345212Z","TaskID":23085605,"Version":0,"CloseExecutionVisibilityTaskID":9663191,"StartTime":null,"CloseTime":null},"wf-history-event-id":0,"error":"context deadline exceeded","lifecycle":"ProcessingFailed","logging-call-at":"lazy_logger.go:68","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/builder/temporal/common/log/zap_logger.go:156\ngo.temporal.io/server/common/log.(*lazyLogger).Error\n\t/home/builder/temporal/common/log/lazy_logger.go:68\ngo.temporal.io/server/service/history/queues.(*executableImpl).HandleErr\n\t/home/builder/temporal/service/history/queues/executable.go:347\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask.func1\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:224\ngo.temporal.io/server/common/backoff.ThrottleRetry.func1\n\t/home/builder/temporal/common/backoff/retry.go:119\ngo.temporal.io/server/common/backoff.ThrottleRetryContext\n\t/home/builder/temporal/common/backoff/retry.go:145\ngo.temporal.io/server/common/backoff.ThrottleRetry\n\t/home/builder/temporal/common/backoff/retry.go:120\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:233\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/builder/temporal/common/tasks/fifo_scheduler.go:211"}

Any idea how to investigate and/or recover from this?

Expected Behavior

Not getting errors, visibility correctly updated

Actual Behavior

Getting an extrem amount of errors, we can see past events listed in temporal-ui, way after retention period. Workflows seems to be running, finishing, we can see them in temporal-ui.

Steps to Reproduce the Problem

Not sure. We did nothing special, it was working fine. We changed mysql password, temporal-service run into some access denied error, service restarted and these logs flooding since then.

Specifications

Version: 1.22.4

Dec 15 '24 16:12 steveetm

After upgrading to the latest version the issue is not fixed, but got a new error:

{"level":"error","ts":"2024-12-16T20:48:10.526Z","msg":"Operation failed with an error.","error":"unable to delete custom search attributes: context deadline exceeded","logging-call-at":"/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:195","stacktrace":"go.temporal.io/server/common/log.(*zapLogger).Error\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/log/zap_logger.go:155\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).updateErrorMetric\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:195\ngo.temporal.io/server/common/persistence/visibility.(*visibilityManagerMetrics).DeleteWorkflowExecution\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/persistence/visibility/visiblity_manager_metrics.go:129\ngo.temporal.io/server/service/history.(*visibilityQueueTaskExecutor).processDeleteExecution\n\t/home/runner/work/docke^Coff/retry.go:64\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).executeTask\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:233\ngo.temporal.io/server/common/tasks.(*FIFOScheduler[...]).processTask\n\t/home/runner/work/docker-builds/docker-builds/temporal/common/tasks/fifo_scheduler.go:211"}

The number of logs emitted is considerably lower, but there are 170k rows in visibility tasks and 64k in executions_visibility (the retention period is one day, this is way more than we should have)

Dec 16 '24 21:12 steveetm

I have the same issue as well.

ServerVersion : 1.25.1

Dec 20 '24 21:12 gkarthiks

I am also facing similar issue.

Workflow records are maintained in executions_visibility table even after the retention period set on the database, this is leading to degraded db performance is impacting all functionalities in temporal.

Is this any way to cleanly delete the records in visibility store that have passed the retention period ?

Temporal Server Version: 1.26.2

May 12 '25 16:05 pranchals