spark icon indicating copy to clipboard operation
spark copied to clipboard

[SPARK-40381][DEPLOY] Support standalone worker recommission

Open warrenzhu25 opened this issue 3 years ago • 4 comments

What changes were proposed in this pull request?

Support standalone worker recommission from master rest api

Why are the changes needed?

Support standalone worker recommission is neeeded.

Does this PR introduce any user-facing change?

Yes. Added a new api in standalone master to recommission workers

How was this patch tested?

Added tests in MasterSuite

warrenzhu25 avatar Sep 07 '22 18:09 warrenzhu25

Can one of the admins verify this patch?

AmplabJenkins avatar Sep 09 '22 08:09 AmplabJenkins

@dongjoon-hyun Could you help take a look?

warrenzhu25 avatar Sep 20 '22 17:09 warrenzhu25

Sorry, but I don't use Standalone cluster.

dongjoon-hyun avatar Sep 20 '22 17:09 dongjoon-hyun

Sorry, but I don't use Standalone cluster.

Any ideas who is right person to review this?

warrenzhu25 avatar Sep 20 '22 17:09 warrenzhu25

Whats the intent of recommissioning? I don't super get it normally decommissioning means the worker is going away is this for the situation where (for example) planned maintance is cancelled?

holdenk avatar Oct 06 '22 20:10 holdenk

Whats the intent of recommissioning? I don't super get it normally decommissioning means the worker is going away is this for the situation where (for example) planned maintance is cancelled?

@holdenk Thanks for looking at this. We're doing manually scaling up and down workers based on spark metric. Manually means we're using master rest api to kill workers. Let's say we have 10 nodes to be decommissioned. 9 nodes finished quickly, but the last one took long as it might have more shuffle data to migrate or some long tail tasks still running. So we want to cancel the long tail node decommission to avoid shuffle data loss or task failure as during decommission it can't run new tasks it's waste of resource.

warrenzhu25 avatar Oct 06 '22 22:10 warrenzhu25

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

github-actions[bot] avatar Jan 15 '23 00:01 github-actions[bot]