queue icon indicating copy to clipboard operation
queue copied to clipboard

[14.0][IMP] queue_job: cancel job before it retries

Open florentx opened this issue 8 months ago • 5 comments

Use case:

  • a job hangs due to some concurrent access in the database, it prevents other job from running
  • after some time it fails with retryable error
  • it is retried short time after, and it can continue like this for some time
  • since the button "Cancel" is not authorized on running jobs, the operator cannot cancel this job

This patch allows to Cancel running jobs:

  • job is started, then operator chooses to cancel it (because it already retried 2 times, for example, and is blocking other jobs)
  • when button is pressed, job record is set to "Cancelled"
  • job will not be interrupted
  • if job ends successfully: its state is saved to "Done"
  • if it fails with an error, its state stays "Cancelled" and it is not retried

florentx avatar Jun 02 '25 07:06 florentx

Hi @guewen, some modules you are maintaining are being modified, check this out!

OCA-git-bot avatar Jun 02 '25 07:06 OCA-git-bot

I understand that it is not perfect solution, however it helps in real situations.

Using action Cancel is a last resort solution when there's some Production issue. It is not so important that the job will finish running (trying) before it is really cancelled.

Moreover, one of the case which is tackled here does not comply with the max_retries parameter. So an alternative solution will not work. These errors are PostgreSQL concurrency errors, like SERIALIZATION_FAILURE. They are retried indefinitely.

In my opinion this button Cancel should be convenient for the system admin which needs some tool to unblock Production.

Then when system load goes down, then Administrator can find easily the "cancelled" jobs and use button "Requeue" in order to process them, one at a time. With button "set to done", it is not so easy to identify which ones should be requeued later (or dealt with in a different manner).

florentx avatar Jun 03 '25 06:06 florentx

But if I understand correctly, in your use case, you could cancel a running job, and it could still end up in done state? Is that not counter intuitive?

These errors are PostgreSQL concurrency errors, like SERIALIZATION_FAILURE. They are retried indefinitely.

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

sbidoul avatar Jun 03 '25 08:06 sbidoul

But if I understand correctly, in your use case, you could cancel a running job, and it could still end up in done state? Is that not counter intuitive?

Yes, if the job is just started, it can finish successfully, even if operator presses button to cancel. This was also the case previously, because Job state was not re-checked before setting "Cancelled". In my opinion, this behavior is acceptable, since button Cancel is for abnormal situations only.

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

Regarding the infinite retry, code is here: https://github.com/OCA/queue/blob/18.0/queue_job/controllers/main.py#L115-L122

Maybe it's legit to keep retrying, if the database load decreases later, and it can terminate successfully. I don't know.

florentx avatar Jun 03 '25 08:06 florentx

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

Regarding the infinite retry, code is here: https://github.com/OCA/queue/blob/18.0/queue_job/controllers/main.py#L115-L122

Maybe it's legit to keep retrying, if the database load decreases later, and it can terminate successfully. I don't know.

TIL. I have been catching SerializationFailure and rethrowing RetryableJobError myself for years! I have never seen a job actually make it all the way to "infinite" retries. Even my most congested jobs running in tens of thousands with a couple dozen workers in parallel still resolve after my retry_pattern spaces them out.

Alternate perspective: Would it make more sense to have the job run out of retries and fail (with some sort of notification), then an admin could restart the job later, rather than have to recover it in real time?

amh-mw avatar Jun 03 '25 11:06 amh-mw

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days. If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

github-actions[bot] avatar Oct 19 '25 12:10 github-actions[bot]