Improve performance of garbage collection job

Open bjester opened this issue 1 year ago • 0 comments

Background

We run the garbage_collect management command everyday via a Kubernetes cron job.

Observed behavior

The management command gets stuck running the following query. It was stuck running this for at least 12 hours in production, without any indication that it would complete.

UPDATE "contentcuration_contentnode" 
SET "parent_id" = '00000000000000000000000000000000' 
WHERE (
  SELECT U0."id" 
  FROM "contentcuration_channel" U0 
  INNER JOIN "contentcuration_contentnode" U1 ON (U0."main_tree_id" = U1."id") 
  WHERE U1."tree_id" = "contentcuration_contentnode"."tree_id" LIMIT 1
) IN (
  '<REDACTED>', 
  '<REDACTED>', 
  '<REDACTED>', 
  '<REDACTED>'
)

Expected behavior

The management command and the queries it produces are optimized for our large Studio database. It should avoid queries that would require the subquery for each node in the result set.

User-facing consequences

Since the queries should occur within a transaction, a very-long running transaction modifying the content node table can cause issues and timeouts across studio, leading to indirect Sentry errors.

Additionally, I had to manually kill the query in Cloud SQL.

Steps to reproduce the issue

Restored production database to hotfixes environment
Manually started the job
Observed running queries

Oct 08 '24 18:10 bjester