studio
studio copied to clipboard
Improve performance of garbage collection job
Background
We run the garbage_collect management command everyday via a Kubernetes cron job.
Observed behavior
The management command gets stuck running the following query. It was stuck running this for at least 12 hours in production, without any indication that it would complete.
UPDATE "contentcuration_contentnode"
SET "parent_id" = '00000000000000000000000000000000'
WHERE (
SELECT U0."id"
FROM "contentcuration_channel" U0
INNER JOIN "contentcuration_contentnode" U1 ON (U0."main_tree_id" = U1."id")
WHERE U1."tree_id" = "contentcuration_contentnode"."tree_id" LIMIT 1
) IN (
'<REDACTED>',
'<REDACTED>',
'<REDACTED>',
'<REDACTED>'
)
Expected behavior
The management command and the queries it produces are optimized for our large Studio database. It should avoid queries that would require the subquery for each node in the result set.
User-facing consequences
Since the queries should occur within a transaction, a very-long running transaction modifying the content node table can cause issues and timeouts across studio, leading to indirect Sentry errors.
Additionally, I had to manually kill the query in Cloud SQL.
Steps to reproduce the issue
- Restored production database to
hotfixesenvironment - Manually started the job
- Observed running queries