multi-node training does not launch on supercomputer cluster
System Info
accelerate 0.23.0
pytorch 2.0.1
working on school's super computer cluster:
Linux version 4.18.0-477.27.1.el8_8.x86_64 ([email protected]) (gcc version 8.5.0 20210514 (Red Hat 8.5.0-18) (GCC)) #1 SMP Thu Aug 31 10:29:22 EDT 2023
Information
- [ ] The official example scripts
- [X] My own modified scripts
Tasks
- [ ] One of the scripts in the examples/ folder of Accelerate or an officially supported
no_trainerscript in theexamplesfolder of thetransformersrepo (such asrun_no_trainer_glue.py) - [X] My own task or dataset (give details below)
Reproduction
each node have 2 A100 cards, using below command to launch training
accelerate launch --config_file gpu8.yaml main.py
gpu8.yaml:
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_process_ip: 10.6.1.10 (example)
main_process_port: 43960
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
the training does not start. log file is empty.
the codebase had been debugged, and always runs fine with 1 or 2 cards on a single node. is the yaml file wrong? and how does accelerate know where are the rest of the machines?
Expected behavior
training should start, and each process will have identifiable output in log file.
Are you running the launch command on each system?
(and modifying the config file on each)
Hi,
I just won the prize for dumbest question asked in the 21st century. I've never trained with multiple nodes before, thanks for pointing that out haha.
I will create use a script to create multiple configs, and assigning each node with appropriate command and config file.
Michael
Hi, sorry to reopen this. I modified my procedures, but its still not working. i am trying to use 4 nodes, each with 2 gpus.
i use a submit.sh file to submit the job, specifying all the parameters needed. within the submit.sh file, a python file will be executed to parse the nodes assigned, create all the required yaml files, and generate another burst_command.sh file with below content:
srun --nodelist=p-gc-3007 accelerate launch --config_file 8gpus_r0.yaml main.py
srun --nodelist=p-gc-3016 accelerate launch --config_file 8gpus_r1.yaml main.py
srun --nodelist=p-gc-3008 accelerate launch --config_file 8gpus_r2.yaml main.py
srun --nodelist=p-gc-3009 accelerate launch --config_file 8gpus_r3.yaml main.py
the submit.sh file will then run the burst_command.sh file, launching all the workers on different nodes.
This is the script i used to start the distributed training. and 8gpus_r0.yaml looks like this
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: MULTI_GPU
downcast_bf16: 'no'
gpu_ids: 0,1
machine_rank: 0
main_process_ip: 10.6.80.11
main_process_port: 30000
main_training_function: main
mixed_precision: bf16
num_machines: 4
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
where the machine_rank are changed to reflect the rank.
the rank0 machine fails with this error:
[W socket.cpp:426] [c10d] The server socket has failed to bind to [::]:30000 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:30000 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
Traceback (most recent call last):
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/agent/server/api.py", line 546, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The server socket has failed to listen on any local network address. The server socket has failed to bind to [::]:30000 (errno: 98 - Address already in use). The server socket has failed to bind to 0.0.0.0:30000 (errno: 98 - Address already in use).
the rest of the ranks fail with this error:
Traceback (most recent call last):
Traceback (most recent call last):
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/bin/accelerate", line 8, in <module>
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/bin/accelerate", line 8, in <module>
sys.exit(main())
^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
sys.exit(main())
^^^^^^
args.func(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 977, in launch_command
args.func(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 636, in multi_gpu_launcher
multi_gpu_launcher(args)
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/commands/launch.py", line 636, in multi_gpu_launcher
current_env = prepare_multi_gpu_env(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/utils/launch.py", line 131, in prepare_multi_gpu_env
current_env = prepare_multi_gpu_env(args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/storage/home/ybh5084/.conda/envs/richtsmeier_training/lib/python3.11/site-packages/accelerate/utils/launch.py", line 131, in prepare_multi_gpu_env
raise ConnectionError(
ConnectionError: Tried to launch distributed communication on port `30000`, but another process is utilizing it. Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.
raise ConnectionError(
ConnectionError: Tried to launch distributed communication on port `30000`, but another process is utilizing it. Please specify a different port (such as using the `----main_process_port` flag or specifying a different `main_process_port` in your config file) and rerun your script. To automatically use the next open port (on a single node), you can set this to `0`.
this looks like an issue with port 30000 not free, but I use a python function to find a free port:
def find_free_port(ip, start_port, end_port):
for port in range(start_port, end_port + 1):
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
try:
s.bind((ip, port))
# If successful, return this port.
return port
except socket.error:
# This port is already in use, try the next one.
continue
raise RuntimeError(f"No free port found in the range {start_port}-{end_port}")
i suspected that the port is not immediately free-ed for accelerate, i also tried to use a fixed port (without the function) 29500, didn't work. I also tried to add sleep 10 at the beginning of the burst_command.sh file, didn't work.
please let me know what else i should provide, and how to resolve this. Is this an accelerate issue? or is it a cluster issue?
@muellerzr can you give some suggestions when you get a chance?
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.
Hi @Michael-H777, I'm also facing a similar issue (I'm also having trouble using accelerate for multi-node training). I'm running my code on my school's supercomputer cluster and don't have experience doing multi node training. Can you tell me how you assigned each node with appropriate command and config file?
@RoozbehNahavandi
PARTIAL SOLUTION!!!
i was not able to get multi-node to work. I keep having issue with socket binding. the below approach was able to assign each node specific command and .yml file, but the socket keeps having issue. I left the university before i can resolve this.
I added additional script at the begining of the launch script.
use the additional script to parse the nodes assigned, create separate .yml file for each node, then generated a secondary sh file with pre-determined name, where each node gets its own .yml in their respective command line.
=============
content of launch script:
#slurm configs xxxx
#slurm configs xxxx
python script_to_parse_nodes_and_create_new_command.py
bash burst_command.sh
=============
content of burst_command.sh
srun --nodelist=p-gc-{node1} accelerate launch --config_file 8gpus_node1.yaml main.py
srun --nodelist=p-gc-{node2} accelerate launch --config_file 8gpus_node2.yaml main.py
srun --nodelist=p-gc-{node3} accelerate launch --config_file 8gpus_node3.yaml main.py
srun --nodelist=p-gc-{node4} accelerate launch --config_file 8gpus_node4.yaml main.py
=============
content of the script
import re
import socket
import subprocess
from contextlib import closing
import yaml
"""
$ echo $SLURM_JOB_NODELIST
p-gc-[2131-2132]
$ nslookup p-gc-2131
Server: 10.6.80.11
Address: 10.6.80.11#53
Name: p-gc-2131.2e.xxx.xxx.xxx
Address: 10.6.0.141
"""
def find_free_port(ip, start_port, end_port):
for port in range(start_port, end_port + 1):
with closing(socket.socket(socket.AF_INET, socket.SOCK_STREAM)) as s:
try:
s.bind((ip, port))
# If successful, return this port.
return port
except socket.error:
# This port is already in use, try the next one.
continue
raise RuntimeError(f"No free port found in the range {start_port}-{end_port}")
def main():
with open("8gpus.yaml", "r") as file:
yaml_config = yaml.safe_load(file)
single_regex = re.compile("[^-](\d{4})[^-]")
range_regex = re.compile("(\d{4}-\d{4})")
# p-sc-[2131-2132]
node_names = subprocess.run("echo $SLURM_JOB_NODELIST", stdout=subprocess.PIPE, shell=True, text=True).stdout
print(f"output of [echo $SLURM_JOB_NODELIST]: {node_names}")
single_nodes = [int(each) for each in single_regex.findall(node_names)]
range_nodes = [each.split("-") for each in range_regex.findall(node_names)]
node_names = single_nodes.copy()
[node_names.extend(range(int(start), int(end) + 1)) for start, end in range_nodes]
print(f"parsed node list: {node_names}")
"""
output of [echo $SLURM_JOB_NODELIST]: p-gc-[3007-3009,3016]
parsed node list: [3016, 3007, 3008, 3009]
"""
hostname = subprocess.run("hostname", stdout=subprocess.PIPE, shell=True, text=True).stdout
hostname = int(hostname.split("-")[-1].strip())
print(f"hostname: {hostname}")
node_names.remove(hostname)
node_names.insert(0, hostname)
head_node = node_names[0]
head_ip = subprocess.run(f"nslookup p-gc-{head_node}", stdout=subprocess.PIPE, shell=True, text=True).stdout
head_ip = head_ip.strip().split("\n")[0].split(":")[1].strip()
print(f"{head_node} has ip: {head_ip}")
"""
$ nslookup p-gc-2131
Server: 10.6.80.11
Address: 10.6.80.11#53
Name: p-gc-2131.2e.xxx.xxx.xxx
Address: 10.6.0.141
"""
yaml_config["main_process_ip"] = head_ip
yaml_config["main_process_port"] = 6000
for rank in range(4):
with open(f"8gpus{rank}.yaml", "w") as file:
yaml_config["machine_rank"] = rank
yaml.dump(yaml_config, file)
bash_template = "srun --nodelist=p-gc-{} {}"
command_template = "accelerate launch --config_file 8gpus{}.yaml main.py"
commands = [
# srun --nodelist=p-gc-{2131} {accelerate launch --config_file gpu8_r{0}.yaml main.py}
bash_template.format(node, command_template.format(rank))
for rank, node in enumerate(node_names)
]
commands.insert(1, 'sleep 20')
with open("burst_command.sh", "w") as fileout:
fileout.write('echo "reached burst_command.sh launch command"\n')
[fileout.write(each + "\n") for each in commands]
if __name__ == "__main__":
main()
edit: depending on which manager ur school's cluster use, you need to go over the documentation and find the appropriate command and method to assign individual commands.
if you were able to resolve the socket issue, please please commend on this thread and tell me how you did it. thanks!