worker icon indicating copy to clipboard operation
worker copied to clipboard

Add configuration setting for retry window

Open mitchell-bu opened this issue 5 years ago • 3 comments

Feature description

After a job has been running for 4 hours, graphile is assuming that the job got stuck and kicks off another worker. We would like either the option to disable the ability for graphile to kick off another worker if locked_at and locked_by are populated OR be able to configure how long to wait before kicking off another worker.

Motivating example

We have a long-running job that takes about 6 hours to complete (we know, long jobs are bad). We've been struggling to try to figure out how we're getting duplicate results from this job before finally catching the 4-hour rule.

Supporting development

If pointed to where in the code this is determined, I'm happy to make a PR for the change.

I [tick all that apply]:

  • [x] am interested in building this feature myself
  • [x] am interested in collaborating on building this feature
  • [x] am willing to help test this feature before it's released
  • [ ] am willing to write a test-driven test suite for this feature (before it exists)
  • [ ] am a Graphile sponsor ❤️
  • [ ] have an active support or consultancy contract with Graphile

mitchell-bu avatar Jan 14 '21 16:01 mitchell-bu

Yeah, jobs are expected to complete in well under 4 hours. I believe that get_job actually accepts an argument that is the duration to wait (default 4 hours); so exposing that as an option that's configurable on CLI and library would be welcome.

We should also, separately, warn people if their jobs have taken more than, say, one quarter of the maximum time. This should let them know they should increase the aforementioned option. (This should be a separate PR to that of the first paragraph.)

I think this can be achieved without modifying the database. Since you're interested in building this feature, it would be great if you opened up a draft PR with a minimal implementation and we'll go from there. I'm not expecting many lines of code to need to change, effectively you just need to accept a new option and pass it to get_job. You should also scan the code to see if the "four hours" idea is hardcoded anywhere other than the default for the get_job function. Don't worry about updating the README/etc at first, I just want to see if it's possible/reasonable to implement this.

benjie avatar Jan 15 '21 16:01 benjie

Great, I see that argument in get_job. Although the reschedule_jobs function has 4 hours hardcoded so it looks like there will need to be a database update to add an argument to that.

mitchell-bu avatar Jan 15 '21 17:01 mitchell-bu

Sounds good; have at it 👍

benjie avatar Jan 16 '21 19:01 benjie

[semi-automated message] Hi, there has been no activity in this issue for a while so I'm closing it to keep the issues/pull requests manageable. If this is still an issue, please re-open with additional details.

benjie avatar Oct 20 '23 16:10 benjie