Distributed training raise an error
Hi,
I am trying to run Romp in distributed mode. I follow this Script. Since there is no folder called core in the repository I replaced it with romp. However, when I run the code it raises the error that there is no file called train.py. How can I avoid this error?
Thanks
Thanks for the bug report. Please replace it as romp.train like this https://github.com/Arthur151/ROMP/blob/master/scripts/train_distributed.sh
@Arthur151 I did try romp.train. Even with that I get the error. Here is what I get:
*****************************************
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
Traceback (most recent call last):
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/z/home/mahzad-khosh/env/romp/bin/python', '-u', 'romp.train', '--local_rank=3', '--GPUS=0,1,2,3', '--configs_yml=configs/v1_hrnet_3dpw_ft.yml', '--distributed_training=1']' returned non-zero exit status 2.
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
What is the reason?
Oh, the command you use is different from what is on my rep. Besides, please make sure that you run the code under ROMP folder.
CUDA_VISIBLE_DEVICES=${GPUS} nohup python -u -m torch.distributed.launch --nproc_per_node=4 romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1 > '../log/'${TAB}'_'${DATASET}'_g'${GPUS}.log 2>&1 &
Your command drops the -m config, which makes python search like a module.
Here is another way to achieve this, here is the format of command if you don't want to use the nohup
CUDA_VISIBLE_DEVICES=${GPUS} python -u torch.distributed.launch --nproc_per_node=4 /path/to/romp/train.py --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1
The key is to use the absolute path to train.py file.
When I use
CUDA_VISIBLE_DEVICES=${GPUS} python -u torch.distributed.launch --nproc_per_node=4 /path/to/romp/train.py --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1
I get the following error: python: can't open file 'torch.distributed.launch': [Errno 2] No such file or directory
When I run with this command:
CUDA_VISIBLE_DEVICES=${GPUS} nohup python -u -m torch.distributed.launch --nproc_per_node=4 romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1
I get this error:
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
Traceback (most recent call last):
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
main()
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/z/home/mahzad-khosh/env/romp/bin/python', '-u', 'romp.train', '--local_rank=3', '--GPUS=0,1,2,3', '--configs_yml=configs/v1_hrnet_3dpw_ft.yml', '--distributed_training=1']' returned non-zero exit status 2.
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
It seems that torch.distributed.launch has been dropped in new version of Pytorch.
In latest version, they use the torchrun instead.
I have tested that this will work
CUDA_VISIBLE_DEVICES=${GPUS} nohup torchrun --nproc_per_node=4 -m romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1 > '../log/'${TAB}'_'${DATASET}'_g'${GPUS}.log 2>&1 &
@Arthur151 Ok replaced torch.distributed.launch with torchrun. When running my code I get the following error:
Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Error processing line 1 of /z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/google_auth-2.6.2-py3.10-nspkg.pth:
Traceback (most recent call last):
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 169, in addpackage
exec(line)
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'types'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 580, in <module>
main()
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 567, in main
known_paths = addsitepackages(known_paths)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 350, in addsitepackages
addsitedir(sitedir, known_paths)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 208, in addsitedir
addpackage(sitedir, name, known_paths)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 179, in addpackage
import traceback
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/traceback.py", line 5, in <module>
import linecache
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/linecache.py", line 11, in <module>
import tokenize
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/tokenize.py", line 32, in <module>
import re
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/re.py", line 124, in <module>
import enum
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/enum.py", line 2, in <module>
from types import MappingProxyType, DynamicClassAttribute
ModuleNotFoundError: No module named 'types'
Error processing line 1 of /z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/google_auth-2.6.2-py3.10-nspkg.pth:
Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Traceback (most recent call last):
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 169, in addpackage
exec(line)
File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'types'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 580, in <module>
main()
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 567, in main
known_paths = addsitepackages(known_paths)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 350, in addsitepackages
addsitedir(sitedir, known_paths)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 208, in addsitedir
addpackage(sitedir, name, known_paths)
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 179, in addpackage
import traceback
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/traceback.py", line 5, in <module>
import linecache
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/linecache.py", line 11, in <module>
import tokenize
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/tokenize.py", line 32, in <module>
import re
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/re.py", line 124, in <module>
import enum
File "/z/home/mahzad-khosh/env/romp/lib/python3.8/enum.py", line 2, in <module>
from types import MappingProxyType, DynamicClassAttribute
ModuleNotFoundError: No module named 'types'
This is pretty weird. You don't have this basic python package? please try import types
@Arthur151 Ok I have cuda 10.2, pytorch==1.10.0, torchvision==0.11.1 and I am getting the error:
/z/home/mahzad-khosh/env/romp/bin/python: No module named torchrun.
My python version is 3.8.13.