ray icon indicating copy to clipboard operation
ray copied to clipboard

[core] Fix demand leak when worker failed

Open fishbone opened this issue 3 years ago • 0 comments

Why are these changes needed?

The root case is that when worker failed, it's not cancelled. This is OK but when there is not enough resource autoscaler is going to scale up and if we don't cancel this it's going to be expensive.

This PR fixed this by canceling the tasks due to failed workers.

Related issue number

Closes #25429

Checks

  • [ ] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [ ] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
  • [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [ ] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

fishbone avatar Dec 18 '22 06:12 fishbone