[18.0][IMP] queue_job: requeue orphaned jobs
This pr add a secondary attempt to requeue "orphaned jobs" where it stucks in enqueued, not yet make it to queue_job_lock and selecting jobs after db initialization also misses it. Even a server restart would not unblock the job.
This is most likely due to server being shutdown or crash during job enqueue. Manually requeue the job on UI make everything running again.
Hi @guewen, some modules you are maintaining are being modified, check this out!
Good catch! I kind of remember we had thought about this case when doing the original implementation with @AnizR but maybe it got lost in the various iterations we did.
In deed, we talked about it but I think that it got lost.
Thanks for this patch, I'll review it within the week.
@sbidoul are we good to merge?
Re-reading this I'm slightly annoyed that it reintroduces a complexity that we had simplified before.
I wonder if it would work to add a OR NOT EXISTS (SELECT 1 FROM queue_job_lock...) in _query_requeue_dead_jobs.
@sbidoul I've adapted the query to your suggestions and tests are passing!
This PR has the approved label and has been created more than 5 days ago. It should therefore be ready to merge by a maintainer (or a PSC member if the concerned addon has no declared maintainer). 🤖
Execution plan looks like this (with the missing where clause added), with one job started with no lock held, and one enqueued with no lock record. So looks good to me:
Update on queue_job (cost=0.43..7.30 rows=1 width=106) (actual time=0.146..0.711 rows=2 loops=1)
-> Index Scan using queue_job_state_index on queue_job (cost=0.43..7.30 rows=1 width=106) (actual time=0.046..0.095 rows=2 loops=1)
Index Cond: ((state)::text = ANY ('{enqueued,started}'::text[]))
Filter: ((date_enqueued < ((now() AT TIME ZONE 'utc'::text) - '00:00:10'::interval)) AND ((SubPlan 1) OR (NOT (SubPlan 2))))
SubPlan 1
-> LockRows (cost=0.43..2.66 rows=1 width=10) (actual time=0.010..0.010 rows=0 loops=2)
-> Index Scan using queue_job_lock_queue_job_id_index on queue_job_lock (cost=0.43..2.65 rows=1 width=10) (actual time=0.008..0.008 rows=0 loops=2)
Index Cond: (queue_job_id = queue_job.id)
SubPlan 2
-> Index Only Scan using queue_job_lock_queue_job_id_index on queue_job_lock queue_job_lock_1 (cost=0.43..2.65 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
Index Cond: (queue_job_id = queue_job.id)
Heap Fetches: 0
Planning Time: 0.247 ms
Trigger queue_job_notify: time=0.083 calls=2
Execution Time: 0.855 ms
Ah no, there is a misunderstanding. What you did was good, but the missing where clause was WHERE queue_job_lock.queue_job_id = queue_job.id in the first id IN .
So like this:
WHERE
state IN ('enqueued','started')
AND date_enqueued < (now() AT TIME ZONE 'utc' - INTERVAL '10 sec')
AND (
id in (
SELECT
queue_job_id
FROM
queue_job_lock
WHERE
queue_job_lock.queue_job_id = queue_job.id
FOR UPDATE SKIP LOCKED
)
OR NOT EXISTS (
SELECT
1
FROM
queue_job_lock
WHERE
queue_job_lock.queue_job_id = queue_job.id
)
)
RETURNING uuid
@sbidoul I've reverted the changes and add the WHERE clause
/ocabot merge patch
On my way to merge this fine PR! Prepared branch 18.0-ocabot-merge-pr-853-by-sbidoul-bump-patch, awaiting test results.
Thanks! This version reads even better than before.
Congratulations, your PR was merged at d52a4332e2741c9989a36758d706cf5124936668. Thanks a lot for contributing to OCA. ❤️