GOBBLIN-1692 Make GobblinHelixJobScheduler stop Helix workflow asynchronously
… test cases
Dear Gobblin maintainers,
Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!
JIRA
- [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1692] My Gobblin PR"
- https://issues.apache.org/jira/browse/GOBBLIN-1692
Description
- [ ] Here are some details about my PR, including screenshots (if applicable): When handleUpdateJobConfigArrival, a new job config gets posted, GobblinHelixJobScheduler will firstly stop and delete the old job, and try to spin up the updated helix workflow. The job scheduler will try to do the stop synchronically with a default 10 seconds timeout setting. However, this stop constantly running longer than the timeout for Helix, causing the job state not correctly updated as stopped. Thus, when construct the GobblinHelixJobLauncher, we will have the previous job in a wrong state as jobRunningMap is not updated yet, causing the new job won’t being launched. So we always see this log: Job {} will not be executed because other jobs are still running.
We can make the job delete asynchronized, and let waitForJobCompletion method to ensure the job status get updated correctly eventually.
Tests
- [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason: Added unit tests.
Commits
- [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
- Subject is separated from body by a blank line
- Subject is limited to 50 characters
- Subject does not end with a period
- Subject uses the imperative mood ("add", not "adding")
- Body wraps at 72 characters
- Body explains "what" and "why", not "how"
Codecov Report
Attention: Patch coverage is 52.94118% with 8 lines in your changes missing coverage. Please review.
Project coverage is 46.81%. Comparing base (
e0d3c78) to head (fb42365). Report is 404 commits behind head on master.
| Files | Patch % | Lines |
|---|---|---|
| ...ache/gobblin/cluster/GobblinHelixJobScheduler.java | 52.94% | 6 Missing and 2 partials :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #3546 +/- ##
============================================
+ Coverage 46.62% 46.81% +0.18%
- Complexity 10456 10501 +45
============================================
Files 2084 2089 +5
Lines 81620 81871 +251
Branches 9103 9126 +23
============================================
+ Hits 38058 38329 +271
+ Misses 40043 40001 -42
- Partials 3519 3541 +22
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
To summarize what this PR is trying to address: when update job event received, the GobblinHelixJobScheduler tries to stop the old one and then launch the new one. When stop the old one, we used to have a sync call of waitToStop through Helix. HelixUtils.waitJobCompletion then detect the job state changed to stopping, then it immediately delete the job, which causing waitToStop always throw exception. Change the waitToStop to a async call can avoid the exception and we'll realize the job is completed by checking the jobRunningMap, which shall be updated in the JobLauncher. To fix the HelixUtils.waitJobCompletion incorrect deletion timing, we'll have a separate PR to address.
Also, be aware that this will break any job which needs more time than 15 mins.
HELIX_JOB_WAIT_COMPLETION_TIMEOUT_SECONDS is the config to tune job wait completion timeout