[Core] Fix incorrect gpu ids if placement group bundle index specified
Why are these changes needed?
This PR fix incorrect CUDA_VISIBLE_DEVICES when placement_group_bundle_index is specified.
Related issue number
Closes https://github.com/ray-project/ray/issues/29811
Checks
- [x] I've signed off every commit(by using the -s flag, i.e.,
git commit -s) in this PR. - [x] I've run
scripts/format.shto lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
doc/source/tune/api/under the corresponding.rstfile.
- [ ] I've added any new APIs to the API Reference. For example, if I added a
method in Tune, I've added it in
- [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
cc @jjyao
Oh actually I think this PR will cause issue if user uses both bundle index and non-bundle index in the same PG.
For example:
ray.init(num_gpus=2) # Two GPUs: 0, 1
pg = placement_group([{"GPU": 1}, {"GPU": 1}])
@ray.remote(num_gpus=1)
class Actor:
pass
actor1 = Actor.options(scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg)).remote()
actor2 = Actor.options(scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg, bundle_index=0)).remote()
It's possible that both actors are assigned to GPU 0.
This is not an easy fix unfortunately given how placement group is implemented now.
Oh actually I think this PR will cause issue if user uses both bundle index and non-bundle index in the same PG.
@jjyao Yes, I can reproduce this problem, I will take a look how to fix it.
@wuxibin89 I think this is not easy to fix right now. This problem will be fixed automatically once we re-implemented PG using labels and virtual clusters (https://github.com/ray-project/enhancements/pull/49)
@jjyao Hi, is there any plan that re-implementing PG with labels and virtual clusters? It seems like there's a lot work to do and may not available in short term. We have a project(will open soon) that heavily relay on this fix to colocate multiple actor groups on the same PG.
So can we fix this first and give user a warning not to use bundle index and non-bundle index in the same PG?
cc @anyscalesam
@wuxibin89 grabbed some time to chat about this further as full fledged VC implementation will be difficult and perhaps overkill.
I think what would make more sense is for us to sit down for 15m and understand your use case well and than brainstorm on a solution that perhaps you can contribute back to.
I setup time for later this week. cc @jjyao