gobblin icon indicating copy to clipboard operation
gobblin copied to clipboard

GOBBLIN-1692 Make GobblinHelixJobScheduler stop Helix workflow asynchronously

Open hanghangliu opened this issue 3 years ago • 4 comments

… test cases

Dear Gobblin maintainers,

Please accept this PR. I understand that it will not be reviewed until I have checked off all the steps below!

JIRA

  • [ ] My PR addresses the following Gobblin JIRA issues and references them in the PR title. For example, "[GOBBLIN-1692] My Gobblin PR"
    • https://issues.apache.org/jira/browse/GOBBLIN-1692

Description

  • [ ] Here are some details about my PR, including screenshots (if applicable): When handleUpdateJobConfigArrival, a new job config gets posted, GobblinHelixJobScheduler will firstly stop and delete the old job, and try to spin up the updated helix workflow. The job scheduler will try to do the stop synchronically with a default 10 seconds timeout setting. However, this stop constantly running longer than the timeout for Helix, causing the job state not correctly updated as stopped. Thus, when construct the GobblinHelixJobLauncher, we will have the previous job in a wrong state as jobRunningMap is not updated yet, causing the new job won’t being launched. So we always see this log: Job {} will not be executed because other jobs are still running.

We can make the job delete asynchronized, and let waitForJobCompletion method to ensure the job status get updated correctly eventually.

Tests

  • [ ] My PR adds the following unit tests OR does not need testing for this extremely good reason: Added unit tests.

Commits

  • [ ] My commits all reference JIRA issues in their subject lines, and I have squashed multiple commits if they address the same issue. In addition, my commits follow the guidelines from "How to write a good git commit message":
    1. Subject is separated from body by a blank line
    2. Subject is limited to 50 characters
    3. Subject does not end with a period
    4. Subject uses the imperative mood ("add", not "adding")
    5. Body wraps at 72 characters
    6. Body explains "what" and "why", not "how"

hanghangliu avatar Aug 26 '22 01:08 hanghangliu

Codecov Report

Attention: Patch coverage is 52.94118% with 8 lines in your changes missing coverage. Please review.

Project coverage is 46.81%. Comparing base (e0d3c78) to head (fb42365). Report is 404 commits behind head on master.

Files Patch % Lines
...ache/gobblin/cluster/GobblinHelixJobScheduler.java 52.94% 6 Missing and 2 partials :warning:
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #3546      +/-   ##
============================================
+ Coverage     46.62%   46.81%   +0.18%     
- Complexity    10456    10501      +45     
============================================
  Files          2084     2089       +5     
  Lines         81620    81871     +251     
  Branches       9103     9126      +23     
============================================
+ Hits          38058    38329     +271     
+ Misses        40043    40001      -42     
- Partials       3519     3541      +22     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov-commenter avatar Aug 26 '22 01:08 codecov-commenter

To summarize what this PR is trying to address: when update job event received, the GobblinHelixJobScheduler tries to stop the old one and then launch the new one. When stop the old one, we used to have a sync call of waitToStop through Helix. HelixUtils.waitJobCompletion then detect the job state changed to stopping, then it immediately delete the job, which causing waitToStop always throw exception. Change the waitToStop to a async call can avoid the exception and we'll realize the job is completed by checking the jobRunningMap, which shall be updated in the JobLauncher. To fix the HelixUtils.waitJobCompletion incorrect deletion timing, we'll have a separate PR to address.

hanghangliu avatar Sep 07 '22 22:09 hanghangliu

Also, be aware that this will break any job which needs more time than 15 mins.

arjun4084346 avatar Sep 07 '22 22:09 arjun4084346

HELIX_JOB_WAIT_COMPLETION_TIMEOUT_SECONDS is the config to tune job wait completion timeout

hanghangliu avatar Sep 07 '22 22:09 hanghangliu