queue icon indicating copy to clipboard operation
queue copied to clipboard

[18.0][IMP] queue_job: requeue orphaned jobs

Open hoangtrann opened this issue 2 months ago • 3 comments

This pr add a secondary attempt to requeue "orphaned jobs" where it stucks in enqueued, not yet make it to queue_job_lock and selecting jobs after db initialization also misses it. Even a server restart would not unblock the job.

This is most likely due to server being shutdown or crash during job enqueue. Manually requeue the job on UI make everything running again.

hoangtrann avatar Nov 21 '25 23:11 hoangtrann

Hi @guewen, some modules you are maintaining are being modified, check this out!

OCA-git-bot avatar Nov 21 '25 23:11 OCA-git-bot

Good catch! I kind of remember we had thought about this case when doing the original implementation with @AnizR but maybe it got lost in the various iterations we did.

In deed, we talked about it but I think that it got lost.

Thanks for this patch, I'll review it within the week.

AnizR avatar Nov 26 '25 06:11 AnizR

@sbidoul are we good to merge?

hoangtrann avatar Dec 10 '25 20:12 hoangtrann

Re-reading this I'm slightly annoyed that it reintroduces a complexity that we had simplified before.

I wonder if it would work to add a OR NOT EXISTS (SELECT 1 FROM queue_job_lock...) in _query_requeue_dead_jobs.

sbidoul avatar Dec 31 '25 10:12 sbidoul

@sbidoul I've adapted the query to your suggestions and tests are passing!

hoangtrann avatar Dec 31 '25 12:12 hoangtrann

This PR has the approved label and has been created more than 5 days ago. It should therefore be ready to merge by a maintainer (or a PSC member if the concerned addon has no declared maintainer). 🤖

OCA-git-bot avatar Dec 31 '25 13:12 OCA-git-bot

Execution plan looks like this (with the missing where clause added), with one job started with no lock held, and one enqueued with no lock record. So looks good to me:

 Update on queue_job  (cost=0.43..7.30 rows=1 width=106) (actual time=0.146..0.711 rows=2 loops=1)
   ->  Index Scan using queue_job_state_index on queue_job  (cost=0.43..7.30 rows=1 width=106) (actual time=0.046..0.095 rows=2 loops=1)
         Index Cond: ((state)::text = ANY ('{enqueued,started}'::text[]))
         Filter: ((date_enqueued < ((now() AT TIME ZONE 'utc'::text) - '00:00:10'::interval)) AND ((SubPlan 1) OR (NOT (SubPlan 2))))
         SubPlan 1
           ->  LockRows  (cost=0.43..2.66 rows=1 width=10) (actual time=0.010..0.010 rows=0 loops=2)
                 ->  Index Scan using queue_job_lock_queue_job_id_index on queue_job_lock  (cost=0.43..2.65 rows=1 width=10) (actual time=0.008..0.008 rows=0 loops=2)
                       Index Cond: (queue_job_id = queue_job.id)
         SubPlan 2
           ->  Index Only Scan using queue_job_lock_queue_job_id_index on queue_job_lock queue_job_lock_1  (cost=0.43..2.65 rows=1 width=0) (actual time=0.004..0.004 rows=0 loops=1)
                 Index Cond: (queue_job_id = queue_job.id)
                 Heap Fetches: 0
 Planning Time: 0.247 ms
 Trigger queue_job_notify: time=0.083 calls=2
 Execution Time: 0.855 ms

sbidoul avatar Jan 02 '26 09:01 sbidoul

Ah no, there is a misunderstanding. What you did was good, but the missing where clause was WHERE queue_job_lock.queue_job_id = queue_job.id in the first id IN .

sbidoul avatar Jan 02 '26 15:01 sbidoul

So like this:

            WHERE
                state IN ('enqueued','started')
                AND date_enqueued < (now() AT TIME ZONE 'utc' - INTERVAL '10 sec')
                AND (
                    id in (
                        SELECT
                            queue_job_id
                        FROM
                            queue_job_lock
                        WHERE
                            queue_job_lock.queue_job_id = queue_job.id
                        FOR UPDATE SKIP LOCKED
                    )
                    OR NOT EXISTS (
                        SELECT
                            1
                        FROM
                            queue_job_lock
                        WHERE
                            queue_job_lock.queue_job_id = queue_job.id
                    )
                )
            RETURNING uuid

sbidoul avatar Jan 02 '26 15:01 sbidoul

@sbidoul I've reverted the changes and add the WHERE clause

hoangtrann avatar Jan 02 '26 15:01 hoangtrann

/ocabot merge patch

sbidoul avatar Jan 02 '26 16:01 sbidoul

On my way to merge this fine PR! Prepared branch 18.0-ocabot-merge-pr-853-by-sbidoul-bump-patch, awaiting test results.

OCA-git-bot avatar Jan 02 '26 16:01 OCA-git-bot

Thanks! This version reads even better than before.

sbidoul avatar Jan 02 '26 16:01 sbidoul

Congratulations, your PR was merged at d52a4332e2741c9989a36758d706cf5124936668. Thanks a lot for contributing to OCA. ❤️

OCA-git-bot avatar Jan 02 '26 16:01 OCA-git-bot