[torchx/ray] Is elastic training on ray clusters supported?
🐛 Bug
Hi, I would like to know the current state of running elastic training on ray clusters.
I tried to repeat some experiments(notebook) in this blog on my ray cluster, but I got unexpected behavior.
- I EXPECT to see when use custom component and the cluster has fewer available nodes than the job requested, the submitted job continues running with current nodes, and when there are new nodes become available, they join can join the training process. What I OBSERVED is the job failed and got the error below:
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}] - When use the built-in
dist.ddpcomponent, even if there are enough computation resources, the ray job status always shows succeed, but from the ray job logs, the expected output never appears, and the only information in the log isWaiting for placement group to start. - When use custom component and the cluster has the required resources, the submitted job has expected log information in the log file, but the job will never stop, when I check the ray job status, it always shown
Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING Status message: Job is currently running.
Question
Module (check all that applies):
- [ ]
torchx.spec - [x]
torchx.component - [ ]
torchx.apps - [ ]
torchx.runtime - [ ]
torchx.cli - [x]
torchx.schedulers - [ ]
torchx.pipelines - [ ]
torchx.aws - [ ]
torchx.examples - [ ]
other
To Reproduce
I tried two ways to launch a TorchX job on ray:
# Use custom component
# Required resouses are defined in the component.py file
torchx run -s ray \ # use ray scheduler
-cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
component.py:trainer # use custom component
# Use built-in dist.ddp component
torchx run -s ray \ # use ray scheduler
-cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
dist.ddp \ # use dist.ddp component
-j 4x1 \ # nproc and nnodes
--script ./compute_world_size.py # a distributed script
A detailed description of the command is here.
The provisioned ray cluster:
"headCPU": "4",
"headGPU": "0",
"headMemory": "12Gi",
"headMaxMemory": "24Gi",
"workerMinCount": 1,
"workerMaxCount": 4,
"workerCPU": "4",
"workerGPU": "0",
"workerMemory": "12Gi",
"workerMaxMemory": "24Gi"
Performed following experiments:
-
(Autoscaling) To test if torchx will trigger ray autoscaler to provide more nodes than the minimum nodes, I launched a job that requires 4 nodes. The results are listed below:
-
Custom component:
-
Ray job status:
Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING Status message: Job is currently running. -
Ray job logs:
Waiting for placement group to start. (scheduler +1s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0. (scheduler +1s) Adding 3 nodes of type worker_node. (scheduler +21s) Resized to 20 CPUs, 4 GPUs. (CommandActor pid=223, ip=10.130.6.73) initializing `gloo` process group (CommandActor pid=223, ip=10.130.6.73) successfully initialized process group (CommandActor pid=223, ip=10.130.6.73) rank: 3, actual world_size: 4, computed world_size: 4 (CommandActor pid=221, ip=10.131.6.32) initializing `gloo` process group (CommandActor pid=221, ip=10.131.6.32) successfully initialized process group (CommandActor pid=221, ip=10.131.6.32) rank: 1, actual world_size: 4, computed world_size: 4 (CommandActor pid=222, ip=10.130.6.74) initializing `gloo` process group (CommandActor pid=222, ip=10.130.6.74) successfully initialized process group (CommandActor pid=222, ip=10.130.6.74) rank: 0, actual world_size: 4, computed world_size: 4 (CommandActor pid=225, ip=10.131.6.30) initializing `gloo` process group (CommandActor pid=225, ip=10.131.6.30) successfully initialized process group (CommandActor pid=225, ip=10.131.6.30) rank: 2, actual world_size: 4, computed world_size: 4 -
Comment: The job can be executed correctly, and output info is shown in the log, however, the job status is stuck at
RUNNINGeven after the job has been finished, and the computing resources were NOT released neither. I have to restart the cluster to submit new jobs.
-
-
Built-in
dist.ddp:-
Ray job status:
------------------------------------------ Job 'raysubmit_EtGmNBAYVKrATdUj' succeeded ------------------------------------------ -
Ray job logs:
Waiting for placement group to start. -
Comment: Ray job status shows job has succeeded, but the log is stuck at waiting for placement group to start. By checking
ray.nodes(), the autoscaler didn't work, there is still 1 active worker node.
-
-
Expected behavior
torchx supports elastic training on ray clusters and has the following features:
- elastic
- autoscaling with ray autoscaler
- fault tolerance
Environment
- torchx version (e.g. 0.1.0rc1): 0.1.2
- Python version: 3.9
- OS (e.g., Linux): Linux
- How you installed torchx (
conda,pip, source,docker): pip - Docker image and tag (if using docker):
- Git commit (if installed from source):
- Execution environment (on-prem, AWS, GCP, Azure etc):
- Any other relevant information:
Additional context
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]
This error shows it's waiting for 5 workers but the cluster config you show "workerMaxCount": 4, only has 4 workers. I'm wondering if that's the root cause here. Can you try running with torchx-nightly? There's been some fixes made to the Ray scheduler since the last 0.1.2 release. We're also planning on releasing 0.2.0 today which should include the Ray fixes
In terms of elasticity -- dist.ddp isn't configured for elasticity out of the box. Torchelastic (the agent used in dist.ddp) does support this but we haven't tested elasticity it and instead always use gang scheduling semantics in our schedulers. It's something that we'd like to support but haven't had any external users request it yet
TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]This error shows it's waiting for 5 workers but the cluster config you show
"workerMaxCount": 4,only has 4 workers. I'm wondering if that's the root cause here. Can you try running with torchx-nightly? There's been some fixes made to the Ray scheduler since the last 0.1.2 release. We're also planning on releasing 0.2.0 today which should include the Ray fixesIn terms of elasticity -- dist.ddp isn't configured for elasticity out of the box. Torchelastic (the agent used in dist.ddp) does support this but we haven't tested elasticity it and instead always use gang scheduling semantics in our schedulers. It's something that we'd like to support but haven't had any external users request it yet
Thanks, I will try torchx-nightly. Good to know dist.ddp is not configured for elasticity out of the box.
@d4l3k May I ask do you have a plan on supporting elastic training on ray? I am looking for making contributions in this feature.
We don't have any current plans in this area but I'm happy to work with you to add it though. If you're interested in contributing this might be good to have a sync up call/chat
@d4l3k Yes, I am interested, let's find a time to sync up. Is Thursday next week a good time for you?
Yeah I can do Thursday -- feel free to throw something on my calendar at [email protected]
@ljjsalt sent you an invite to the PT slack