queue icon indicating copy to clipboard operation
queue copied to clipboard

After requeueing a job, first run often fails to update date_done with concurrent error

Open thomaspaulb opened this issue 2 months ago • 6 comments

Module

queue_job

Describe the bug

I see in my log many cases of:

# good case
2025-12-09 07:22:43,904 2229881 WARNING ? odoo.addons.queue_job.jobrunner.runner: Re-queued dead job with uuid: 76c33f7c-2a1f-4564-9545-2daccf99026d 
2025-12-09 07:22:43,908 2229881 INFO ? odoo.addons.queue_job.jobrunner.runner: asking Odoo to run job 76c33f7c-2a1f-4564-9545-2daccf99026d on db ....

# bad case
2025-12-09 07:22:43,922 2229881 WARNING ? odoo.addons.queue_job.jobrunner.runner: Re-queued dead job with uuid: b8b82330-6c0d-44a6-b12c-1fde5bf8ee42 
2025-12-09 07:22:43,926 2229881 INFO ? odoo.addons.queue_job.jobrunner.runner: asking Odoo to run job b8b82330-6c0d-44a6-b12c-1fde5bf8ee42 on db ....
2025-12-09 07:22:43,954 2820876 ERROR .... odoo.sql_db: bad query: UPDATE "queue_job" SET "date_done" = '2025-12-09 07:22:43.951733', "dependencies" = '{"depends_on": [], "reverse_depends_on": []}', "exec_time" = 0.034894, "retry" = 1, "state" = 'done' WHERE id IN (22671925)
ERROR: could not serialize access due to concurrent update

To Reproduce

Affected versions: 16.0

Steps to reproduce the behavior:

  1. Have a job start
  2. Kill the worker to make the job "dead"
  3. Job gets enqueued
  4. Job starts
  5. Concurrent error; second retry goes through

Expected behavior First run directly goes through

Additional context I don't know at the moment what's locking the queue job record - if it is the previous "dead" job (unlikely, because jobrunner itself already wrote to that record), or the jobrunner itself locking the record (more likely, but also strange since the cursor should be committed and closed at that point)

thomaspaulb avatar Dec 09 '25 11:12 thomaspaulb

More details after running through one example, there seems to be a chain of events leading up to this:

2025-12-09 07:22:31,847 2229881 INFO ? odoo.addons.queue_job.jobrunner.runner: asking Odoo to run job 76c33f7c-2a1f-4564-9545-2daccf99026d on db $customer 
2025-12-09 07:22:43,904 2229881 WARNING ? odoo.addons.queue_job.jobrunner.runner: Re-queued dead job with uuid: 76c33f7c-2a1f-4564-9545-2daccf99026d
2025-12-09 07:22:43,908 2229881 INFO ? odoo.addons.queue_job.jobrunner.runner: asking Odoo to run job 76c33f7c-2a1f-4564-9545-2daccf99026d on db $customer
2025-12-09 07:22:45,639 2805650 INFO $customer odoo.addons.$customer.models.sale_order: Sale Order: SO2945974 confirmed, state done!
2025-12-09 07:22:45,693 2805650 ERROR $customer odoo.sql_db: bad query: UPDATE "queue_job" SET "date_done" = '2025-12-09 07:22:45.690848', "dependencies" = '{"depends_on": [], "reverse_depends_on": []}', "exec_time" = 1.797245, "retry" = 1, "state" = 'done' WHERE id IN (22671849)
ERROR: could not serialize access due to concurrent update
2025-12-09 07:22:53,694 2820876 WARNING $customer odoo.addons.queue_job.controllers.main: was requested to run job 76c33f7c-2a1f-4564-9545-2daccf99026d, but it does not exist, or is not in state enqueued
2025-12-09 07:22:54,487 2229881 INFO ? odoo.addons.queue_job.jobrunner.runner: asking Odoo to run job 76c33f7c-2a1f-4564-9545-2daccf99026d on db $customer
2025-12-09 07:23:04,546 2229881 WARNING ? odoo.addons.queue_job.jobrunner.runner: Re-queued dead job with uuid: 76c33f7c-2a1f-4564-9545-2daccf99026d
2025-12-09 07:23:07,793 2805650 INFO $customer odoo.addons.$customer.models.sale_order: Sale Order: SO2945974 confirmed, state done!
2025-12-09 07:23:12,493 2805650 WARNING $customer odoo.addons.queue_job.controllers.main: was requested to run job 76c33f7c-2a1f-4564-9545-2daccf99026d, but it does not exist, or is not in state enqueued 
$customer=> select id, state, date_done, retry from queue_job where uuid = '76c33f7c-2a1f-4564-9545-2daccf99026d';
    id    | state |         date_done          | retry 
----------+-------+----------------------------+-------
 22671849 | done  | 2025-12-09 07:23:07.843269 |     2

So timeline of events here is:

  • Jobrunner asks Odoo to do the job
  • 12 seconds later, job is presumed dead. queue_job_lock lock got lost? Jobrunner again asks Odoo to do the job
  • Job runs, but can't be set to done, because of lock on queue_job
  • Job runs again, but can't be found as runnable, so is skipped.
  • Jobrunner asks Odoo to run the job
  • 10 seconds later, job is presumed dead, queue_job_lock lock got lost?
  • Job runs and completes

thomaspaulb avatar Dec 09 '25 11:12 thomaspaulb

Okay I found the cause - this happens whenever all of Odoo's workers are occupied for longer than 10 seconds. Even though the job is already 'enqueued' in the sense that, the HTTP request started and we are waiting for Odoo to pick it up; the jobrunner proactively requeues the job even though it's still in the HTTP pipeline so to say.

Proposing to solve this by having a second lock in the job runner that marks whether a job is enqueued; aside from the current one that marks when a job is started.

thomaspaulb avatar Dec 10 '25 08:12 thomaspaulb

Hmm that case should be cared for by the code that checks that the job is well in enqueued state when starting it.

sbidoul avatar Dec 10 '25 19:12 sbidoul

But what happens is that the job gets set to enqueued state, but it is not started, because no worker is available. So there is also no lock. This situation continues for 10 seconds until the job is reset to pending by the deadlock detector. And then it is enqueued again. When then the job finally starts, it goes through because it is in enqueued state. The second time it tries to run it is skipped indeed, so it does not run twice, but it is still a bit of a messy business.

10 dec. 2025 20:38:37 Stéphane Bidoul @.***>:

 [Afbeelding]*sbidoul* left a comment (OCA/queue#858)[https://github.com/OCA/queue/issues/858#issuecomment-3638667927]

Hmm that case should be cared for by the code that checks that the job is well in enqueued state when starting it.

— Reply to this email directly, view it on GitHub[https://github.com/OCA/queue/issues/858#issuecomment-3638667927], or unsubscribe[https://github.com/notifications/unsubscribe-auth/AALF75BWYONZ6T4M3CFFUML4BBZDJAVCNFSM6AAAAACOPNCZZSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMZYGY3DOOJSG4]. You are receiving this because you authored the thread. [Tracking afbeelding][https://github.com/notifications/beacon/AALF75CXCOW7WRAD72A4XGD4BBZDJA5CNFSM6AAAAACOPNCZZSWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTWY4GVJO.gif]

thomaspaulb avatar Dec 10 '25 19:12 thomaspaulb

Yes but I don't quite understand where the concurrent update error comes from.

In your log above the job seems to run twice?

sbidoul avatar Dec 10 '25 21:12 sbidoul

You're right, it repeats because of the concurrent update. I haven't yet figured out where that comes from - there must be something that touches the queue_job record while the job is running.

That is supposed to be prevented by this "for update" lock, but that lock is released again after this commit call. Maybe job.lock() should also lock the actual queue_job record again, after asserting at that point the job state is "started", ... that could be another safeguard but still does not explain why it happens.

thomaspaulb avatar Dec 10 '25 22:12 thomaspaulb