torchx [torchx/ray] Is elastic training on ray clusters supported?

🐛 Bug

Hi, I would like to know the current state of running elastic training on ray clusters.

I tried to repeat some experiments(notebook) in this blog on my ray cluster, but I got unexpected behavior.

I EXPECT to see when use custom component and the cluster has fewer available nodes than the job requested, the submitted job continues running with current nodes, and when there are new nodes become available, they join can join the training process. What I OBSERVED is the job failed and got the error below:

TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]

When use the built-in dist.ddp component, even if there are enough computation resources, the ray job status always shows succeed, but from the ray job logs, the expected output never appears, and the only information in the log is
```
Waiting for placement group to start.
```
When use custom component and the cluster has the required resources, the submitted job has expected log information in the log file, but the job will never stop, when I check the ray job status, it always shown
```
Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING
Status message: Job is currently running.
```

Question

Module (check all that applies):

[ ] torchx.spec
[x] torchx.component
[ ] torchx.apps
[ ] torchx.runtime
[ ] torchx.cli
[x] torchx.schedulers
[ ] torchx.pipelines
[ ] torchx.aws
[ ] torchx.examples
[ ] other

To Reproduce

I tried two ways to launch a TorchX job on ray:

# Use custom component
# Required resouses are defined in the component.py file
torchx run -s ray \ # use ray scheduler
    -cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
    component.py:trainer # use custom component

# Use built-in dist.ddp component
torchx run -s ray \ # use ray scheduler
    -cfg dashboard_address=addr-of-cluster:8265,working_dir=. \ # ray scheduler arguments
    dist.ddp \ # use dist.ddp component
    -j 4x1 \ # nproc and nnodes
    --script ./compute_world_size.py # a distributed script

A detailed description of the command is here.

The provisioned ray cluster:

"headCPU": "4",
"headGPU": "0",
"headMemory": "12Gi",
"headMaxMemory": "24Gi", 
"workerMinCount": 1, 
"workerMaxCount": 4,
"workerCPU": "4",
"workerGPU": "0",
"workerMemory": "12Gi",
"workerMaxMemory": "24Gi"

Performed following experiments:

(Autoscaling) To test if torchx will trigger ray autoscaler to provide more nodes than the minimum nodes, I launched a job that requires 4 nodes. The results are listed below:

Custom component:

Ray job status:

Status for job 'raysubmit_kqtEAYVSmx4c1XgD': RUNNING
Status message: Job is currently running.

Ray job logs:

Waiting for placement group to start.
(scheduler +1s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +1s) Adding 3 nodes of type worker_node.
(scheduler +21s) Resized to 20 CPUs, 4 GPUs.
(CommandActor pid=223, ip=10.130.6.73) initializing `gloo` process group
(CommandActor pid=223, ip=10.130.6.73) successfully initialized process group
(CommandActor pid=223, ip=10.130.6.73) rank: 3, actual world_size: 4, computed world_size: 4
(CommandActor pid=221, ip=10.131.6.32) initializing `gloo` process group
(CommandActor pid=221, ip=10.131.6.32) successfully initialized process group
(CommandActor pid=221, ip=10.131.6.32) rank: 1, actual world_size: 4, computed world_size: 4
(CommandActor pid=222, ip=10.130.6.74) initializing `gloo` process group
(CommandActor pid=222, ip=10.130.6.74) successfully initialized process group
(CommandActor pid=222, ip=10.130.6.74) rank: 0, actual world_size: 4, computed world_size: 4
(CommandActor pid=225, ip=10.131.6.30) initializing `gloo` process group
(CommandActor pid=225, ip=10.131.6.30) successfully initialized process group
(CommandActor pid=225, ip=10.131.6.30) rank: 2, actual world_size: 4, computed world_size: 4

Comment: The job can be executed correctly, and output info is shown in the log, however, the job status is stuck at RUNNING even after the job has been finished, and the computing resources were NOT released neither. I have to restart the cluster to submit new jobs.

Built-in dist.ddp:
- Ray job status:
```
------------------------------------------
Job 'raysubmit_EtGmNBAYVKrATdUj' succeeded
------------------------------------------
```
- Ray job logs:
```
Waiting for placement group to start.
```
- Comment: Ray job status shows job has succeeded, but the log is stuck at waiting for placement group to start. By checking ray.nodes(), the autoscaler didn't work, there is still 1 active worker node.

Expected behavior

torchx supports elastic training on ray clusters and has the following features:

elastic
autoscaling with ray autoscaler
fault tolerance

Environment

torchx version (e.g. 0.1.0rc1): 0.1.2
Python version: 3.9
OS (e.g., Linux): Linux
How you installed torchx (conda, pip, source, docker): pip
Docker image and tag (if using docker):
Git commit (if installed from source):
Execution environment (on-prem, AWS, GCP, Azure etc):
Any other relevant information:

Additional context

Jun 15 '22 18:06 ntlm1686

TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]

This error shows it's waiting for 5 workers but the cluster config you show "workerMaxCount": 4, only has 4 workers. I'm wondering if that's the root cause here. Can you try running with torchx-nightly? There's been some fixes made to the Ray scheduler since the last 0.1.2 release. We're also planning on releasing 0.2.0 today which should include the Ray fixes

In terms of elasticity -- dist.ddp isn't configured for elasticity out of the box. Torchelastic (the agent used in dist.ddp) does support this but we haven't tested elasticity it and instead always use gang scheduling semantics in our schedulers. It's something that we'd like to support but haven't had any external users request it yet

Jun 15 '22 20:06 d4l3k

TimeoutError: Placement group creation timed out. Make sure your cluster either has enough resources or use an autoscaling cluster. Current resources available: {'memory': 18038862642.0, 'CPU': 8.0, 'node:10.130.6.66': 0.999, 'object_store_memory': 15071908982.0, 'GPU': 1.0, 'node:10.130.6.67': 1.0}, resources requested by the placement group: [{'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}, {'CPU': 2.0}]
This error shows it's waiting for 5 workers but the cluster config you show "workerMaxCount": 4, only has 4 workers. I'm wondering if that's the root cause here. Can you try running with torchx-nightly? There's been some fixes made to the Ray scheduler since the last 0.1.2 release. We're also planning on releasing 0.2.0 today which should include the Ray fixes

In terms of elasticity -- dist.ddp isn't configured for elasticity out of the box. Torchelastic (the agent used in dist.ddp) does support this but we haven't tested elasticity it and instead always use gang scheduling semantics in our schedulers. It's something that we'd like to support but haven't had any external users request it yet

Thanks, I will try torchx-nightly. Good to know dist.ddp is not configured for elasticity out of the box.

Jun 16 '22 13:06 ntlm1686

@d4l3k May I ask do you have a plan on supporting elastic training on ray? I am looking for making contributions in this feature.

Jun 16 '22 19:06 ntlm1686

We don't have any current plans in this area but I'm happy to work with you to add it though. If you're interested in contributing this might be good to have a sync up call/chat

Jun 17 '22 00:06 d4l3k

@d4l3k Yes, I am interested, let's find a time to sync up. Is Thursday next week a good time for you?

Jun 22 '22 13:06 ntlm1686

Yeah I can do Thursday -- feel free to throw something on my calendar at [email protected]

Jun 22 '22 21:06 d4l3k

@ljjsalt sent you an invite to the PT slack

Jun 22 '22 21:06 d4l3k