accelerate multi-node training does not launch on supercomputer cluster

System Info

accelerate 0.23.0 
pytorch 2.0.1

working on school's super computer cluster:
Linux version 4.18.0-477.27.1.el8_8.x86_64 ([email protected]) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)) #1 SMP Thu Aug 31 10:29:22 EDT 2023

Information

[ ] The official example scripts
[X] My own modified scripts

Tasks

[ ] One of the scripts in the examples/ folder of Accelerate or an officially supported no_trainer script in the examples folder of the transformers repo (such as run_no_trainer_glue.py)
[X] My own task or dataset (give details below)

Reproduction

each node have 2 A100 cards, using below command to launch training

accelerate launch --config_file gpu8.yaml main.py

gpu8.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_process_ip: 10.6.1.10 (example)
main_process_port: 43960
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

the training does not start. log file is empty.

the codebase had been debugged, and always runs fine with 1 or 2 cards on a single node. is the yaml file wrong? and how does accelerate know where are the rest of the machines?

Expected behavior

training should start, and each process will have identifiable output in log file.

Dec 17 '23 16:12 Michael-H777

Are you running the launch command on each system?

(and modifying the config file on each)

Dec 17 '23 16:12 muellerzr

Hi,

I just won the prize for dumbest question asked in the 21st century. I've never trained with multiple nodes before, thanks for pointing that out haha.

I will create use a script to create multiple configs, and assigning each node with appropriate command and config file.

Michael

Dec 22 '23 20:12 Michael-H777

Hi, sorry to reopen this. I modified my procedures, but its still not working. i am trying to use 4 nodes, each with 2 gpus.

i use a submit.sh file to submit the job, specifying all the parameters needed. within the submit.sh file, a python file will be executed to parse the nodes assigned, create all the required yaml files, and generate another burst_command.sh file with below content:

srun --nodelist=p-gc-3007 accelerate launch --config_file 8gpus_r0.yaml main.py
srun --nodelist=p-gc-3016 accelerate launch --config_file 8gpus_r1.yaml main.py
srun --nodelist=p-gc-3008 accelerate launch --config_file 8gpus_r2.yaml main.py
srun --nodelist=p-gc-3009 accelerate launch --config_file 8gpus_r3.yaml main.py

the submit.sh file will then run the burst_command.sh file, launching all the workers on different nodes.

This is the script i used to start the distributed training. and 8gpus_r0.yaml looks like this

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_process_ip: 10.6.80.11
main_process_port: 30000
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

where the machine_rank are changed to reflect the rank.

the rank0 machine fails with this error:

[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:30000 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:30000 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    args.func(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
    distrib_run.run(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:30000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:30000 (errno: 98 - Address already in use).

the rest of the ranks fail with this error:

Traceback (most recent call last):
Traceback (most recent call last):
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/bin/accelerate", line 8, in <module>
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
    sys.exit(main())
             ^^^^^^
    args.func(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    args.func(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 977, in launch_command
    multi_gpu_launcher(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 636, in multi_gpu_launcher
    multi_gpu_launcher(args)
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 636, in multi_gpu_launcher
    current_env = prepare_multi_gpu_env(args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/utils/launch.py", line 131, in prepare_multi_gpu_env
    current_env = prepare_multi_gpu_env(args)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/utils/launch.py", line 131, in prepare_multi_gpu_env
    raise ConnectionError(
ConnectionError: Tried to launch distributed communication on port `30000`, but another process is utilizing it. Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.
    raise ConnectionError(
ConnectionError: Tried to launch distributed communication on port `30000`, but another process is utilizing it. Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.

this looks like an issue with port 30000 not free, but I use a python function to find a free port:

def find_free_port(ip, start_port, end_port):
    for port in range(start_port, end_port + 1):
        with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
            try:
                s.bind((ip, port))
                # If successful, return this port.
                return port
            except socket.error:
                # This port is already in use, try the next one.
                continue
    raise RuntimeError(f"No free port found in the range {start_port}-{end_port}")

i suspected that the port is not immediately free-ed for accelerate, i also tried to use a fixed port (without the function) 29500, didn't work. I also tried to add sleep 10 at the beginning of the burst_command.sh file, didn't work.

please let me know what else i should provide, and how to resolve this. Is this an accelerate issue? or is it a cluster issue?

Feb 17 '24 18:02 Michael-H777

@muellerzr can you give some suggestions when you get a chance?

Feb 26 '24 23:02 Michael-H777

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Mar 22 '24 15:03 github-actions[bot]

Hi @Michael-H777, I'm also facing a similar issue (I'm also having trouble using accelerate for multi-node training). I'm running my code on my school's supercomputer cluster and don't have experience doing multi node training. Can you tell me how you assigned each node with appropriate command and config file?

Oct 24 '24 16:10 RoozbehNahavandi

@RoozbehNahavandi

PARTIAL SOLUTION!!!

i was not able to get multi-node to work. I keep having issue with socket binding. the below approach was able to assign each node specific command and .yml file, but the socket keeps having issue. I left the university before i can resolve this.

I added additional script at the begining of the launch script.

use the additional script to parse the nodes assigned, create separate .yml file for each node, then generated a secondary sh file with pre-determined name, where each node gets its own .yml in their respective command line.

=============
content of launch script: 
#slurm configs xxxx
#slurm configs xxxx

python script_to_parse_nodes_and_create_new_command.py 

bash burst_command.sh 
=============
content of burst_command.sh
srun --nodelist=p-gc-{node1} accelerate launch --config_file 8gpus_node1.yaml main.py
srun --nodelist=p-gc-{node2} accelerate launch --config_file 8gpus_node2.yaml main.py
srun --nodelist=p-gc-{node3} accelerate launch --config_file 8gpus_node3.yaml main.py
srun --nodelist=p-gc-{node4} accelerate launch --config_file 8gpus_node4.yaml main.py
=============

content of the script

import re
import socket
import subprocess
from contextlib import closing

import yaml

"""
$ echo $SLURM_JOB_NODELIST
p-gc-[2131-2132]

$ nslookup p-gc-2131
Server: 10.6.80.11
Address: 10.6.80.11#53

Name: p-gc-2131.2e.xxx.xxx.xxx
Address: 10.6.0.141
"""


def find_free_port(ip, start_port, end_port):
    for port in range(start_port, end_port + 1):
        with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
            try:
                s.bind((ip, port))
                # If successful, return this port.
                return port
            except socket.error:
                # This port is already in use, try the next one.
                continue
    raise RuntimeError(f"No free port found in the range {start_port}-{end_port}")


def main():
    with open("8gpus.yaml", "r") as file:
        yaml_config = yaml.safe_load(file)

    single_regex = re.compile("[^-](\d{4})[^-]")
    range_regex = re.compile("(\d{4}-\d{4})")
    # p-sc-[2131-2132]
    node_names = subprocess.run("echo $SLURM_JOB_NODELIST", stdout=subprocess.PIPE, shell=True, text=True).stdout
    print(f"output of [echo $SLURM_JOB_NODELIST]: {node_names}")
    single_nodes = [int(each) for each in single_regex.findall(node_names)]
    range_nodes = [each.split("-") for each in range_regex.findall(node_names)]
    node_names = single_nodes.copy()
    [node_names.extend(range(int(start), int(end) + 1)) for start, end in range_nodes]
    print(f"parsed node list: {node_names}")
    """
    output of [echo $SLURM_JOB_NODELIST]: p-gc-[3007-3009,3016]
    parsed node list: [3016, 3007, 3008, 3009]
    """

    hostname = subprocess.run("hostname", stdout=subprocess.PIPE, shell=True, text=True).stdout
    hostname = int(hostname.split("-")[-1].strip())
    print(f"hostname: {hostname}")

    node_names.remove(hostname)
    node_names.insert(0, hostname)

    head_node = node_names[0]
    head_ip = subprocess.run(f"nslookup p-gc-{head_node}", stdout=subprocess.PIPE, shell=True, text=True).stdout
    head_ip = head_ip.strip().split("\n")[0].split(":")[1].strip()
    print(f"{head_node} has ip: {head_ip}")
    """
    $ nslookup p-gc-2131
    Server: 10.6.80.11
    Address: 10.6.80.11#53

    Name: p-gc-2131.2e.xxx.xxx.xxx
    Address: 10.6.0.141
    """

    yaml_config["main_process_ip"] = head_ip
    yaml_config["main_process_port"] = 6000

    for rank in range(4):
        with open(f"8gpus{rank}.yaml", "w") as file:
            yaml_config["machine_rank"] = rank
            yaml.dump(yaml_config, file)

    bash_template = "srun --nodelist=p-gc-{} {}"
    command_template = "accelerate launch --config_file 8gpus{}.yaml main.py"

    commands = [
        # srun --nodelist=p-gc-{2131} {accelerate launch --config_file gpu8_r{0}.yaml main.py}
        bash_template.format(node, command_template.format(rank))
        for rank, node in enumerate(node_names)
    ]
    commands.insert(1, 'sleep 20')
    with open("burst_command.sh", "w") as fileout:
        fileout.write('echo "reached burst_command.sh launch command"\n')
        [fileout.write(each + "\n") for each in commands]


if __name__ == "__main__":
    main()

edit: depending on which manager ur school's cluster use, you need to go over the documentation and find the appropriate command and method to assign individual commands.

if you were able to resolve the socket issue, please please commend on this thread and tell me how you did it. thanks!

Oct 24 '24 17:10 Michael-H777