ray icon indicating copy to clipboard operation
ray copied to clipboard

[Core] Fix incorrect gpu ids if placement group bundle index specified

Open wuxibin89 opened this issue 2 years ago • 3 comments

Why are these changes needed?

This PR fix incorrect CUDA_VISIBLE_DEVICES when placement_group_bundle_index is specified.

Related issue number

Closes https://github.com/ray-project/ray/issues/29811

Checks

  • [x] I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • [x] I've run scripts/format.sh to lint the changes in this PR.
  • [ ] I've included any doc changes needed for https://docs.ray.io/en/master/.
    • [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in doc/source/tune/api/ under the corresponding .rst file.
  • [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • [x] Unit tests
    • [ ] Release tests
    • [ ] This PR is not tested :(

cc @jjyao

wuxibin89 avatar Apr 01 '24 15:04 wuxibin89

Oh actually I think this PR will cause issue if user uses both bundle index and non-bundle index in the same PG.

For example:

ray.init(num_gpus=2) # Two GPUs: 0, 1
pg = placement_group([{"GPU": 1}, {"GPU": 1}])

@ray.remote(num_gpus=1)
class Actor:
  pass

actor1 = Actor.options(scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg)).remote()
actor2 = Actor.options(scheduling_strategy=PlacementGroupSchedulingStrategy(placement_group=pg, bundle_index=0)).remote()

It's possible that both actors are assigned to GPU 0.

This is not an easy fix unfortunately given how placement group is implemented now.

jjyao avatar Apr 16 '24 21:04 jjyao

Oh actually I think this PR will cause issue if user uses both bundle index and non-bundle index in the same PG.

@jjyao Yes, I can reproduce this problem, I will take a look how to fix it.

wuxibin89 avatar Apr 22 '24 07:04 wuxibin89

@wuxibin89 I think this is not easy to fix right now. This problem will be fixed automatically once we re-implemented PG using labels and virtual clusters (https://github.com/ray-project/enhancements/pull/49)

jjyao avatar Apr 29 '24 21:04 jjyao

@jjyao Hi, is there any plan that re-implementing PG with labels and virtual clusters? It seems like there's a lot work to do and may not available in short term. We have a project(will open soon) that heavily relay on this fix to colocate multiple actor groups on the same PG.

So can we fix this first and give user a warning not to use bundle index and non-bundle index in the same PG?

cc @anyscalesam

wuxibin89 avatar May 22 '24 02:05 wuxibin89

@wuxibin89 grabbed some time to chat about this further as full fledged VC implementation will be difficult and perhaps overkill.

I think what would make more sense is for us to sit down for 15m and understand your use case well and than brainstorm on a solution that perhaps you can contribute back to.

I setup time for later this week. cc @jjyao

anyscalesam avatar Jul 15 '24 20:07 anyscalesam