[SPARK-40381][DEPLOY] Support standalone worker recommission
What changes were proposed in this pull request?
Support standalone worker recommission from master rest api
Why are the changes needed?
Support standalone worker recommission is neeeded.
Does this PR introduce any user-facing change?
Yes. Added a new api in standalone master to recommission workers
How was this patch tested?
Added tests in MasterSuite
Can one of the admins verify this patch?
@dongjoon-hyun Could you help take a look?
Sorry, but I don't use Standalone cluster.
Sorry, but I don't use Standalone cluster.
Any ideas who is right person to review this?
Whats the intent of recommissioning? I don't super get it normally decommissioning means the worker is going away is this for the situation where (for example) planned maintance is cancelled?
Whats the intent of recommissioning? I don't super get it normally decommissioning means the worker is going away is this for the situation where (for example) planned maintance is cancelled?
@holdenk Thanks for looking at this. We're doing manually scaling up and down workers based on spark metric. Manually means we're using master rest api to kill workers. Let's say we have 10 nodes to be decommissioned. 9 nodes finished quickly, but the last one took long as it might have more shuffle data to migrate or some long tail tasks still running. So we want to cancel the long tail node decommission to avoid shuffle data loss or task failure as during decommission it can't run new tasks it's waste of resource.
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!