ray
ray copied to clipboard
[core] Fix demand leak when worker failed
Why are these changes needed?
The root case is that when worker failed, it's not cancelled. This is OK but when there is not enough resource autoscaler is going to scale up and if we don't cancel this it's going to be expensive.
This PR fixed this by canceling the tasks due to failed workers.
Related issue number
Closes #25429
Checks
- [ ] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [ ] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(