ROMP icon indicating copy to clipboard operation
ROMP copied to clipboard

Distributed training raise an error

Open mkhoshle opened this issue 3 years ago • 8 comments

Hi,

I am trying to run Romp in distributed mode. I follow this Script. Since there is no folder called core in the repository I replaced it with romp. However, when I run the code it raises the error that there is no file called train.py. How can I avoid this error?

Thanks

mkhoshle avatar May 05 '22 22:05 mkhoshle

Thanks for the bug report. Please replace it as romp.train like this https://github.com/Arthur151/ROMP/blob/master/scripts/train_distributed.sh

Arthur151 avatar May 06 '22 06:05 Arthur151

@Arthur151 I did try romp.train. Even with that I get the error. Here is what I get:

*****************************************
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/z/home/mahzad-khosh/env/romp/bin/python', '-u', 'romp.train', '--local_rank=3', '--GPUS=0,1,2,3', '--configs_yml=configs/v1_hrnet_3dpw_ft.yml', '--distributed_training=1']' returned non-zero exit status 2.
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory

What is the reason?

mkhoshle avatar May 06 '22 13:05 mkhoshle

Oh, the command you use is different from what is on my rep. Besides, please make sure that you run the code under ROMP folder.

CUDA_VISIBLE_DEVICES=${GPUS} nohup python -u -m torch.distributed.launch --nproc_per_node=4 romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1 > '../log/'${TAB}'_'${DATASET}'_g'${GPUS}.log 2>&1 &

Your command drops the -m config, which makes python search like a module.

Here is another way to achieve this, here is the format of command if you don't want to use the nohup

CUDA_VISIBLE_DEVICES=${GPUS} python -u torch.distributed.launch --nproc_per_node=4 /path/to/romp/train.py --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1

The key is to use the absolute path to train.py file.

Arthur151 avatar May 06 '22 14:05 Arthur151

When I use

CUDA_VISIBLE_DEVICES=${GPUS} python -u torch.distributed.launch --nproc_per_node=4 /path/to/romp/train.py --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1

I get the following error: python: can't open file 'torch.distributed.launch': [Errno 2] No such file or directory

When I run with this command:

CUDA_VISIBLE_DEVICES=${GPUS} nohup python -u -m torch.distributed.launch --nproc_per_node=4 romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1

I get this error:

/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 261, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/torch/distributed/launch.py", line 256, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/z/home/mahzad-khosh/env/romp/bin/python', '-u', 'romp.train', '--local_rank=3', '--GPUS=0,1,2,3', '--configs_yml=configs/v1_hrnet_3dpw_ft.yml', '--distributed_training=1']' returned non-zero exit status 2.
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory
/z/home/mahzad-khosh/env/romp/bin/python: can't open file 'romp.train': [Errno 2] No such file or directory

mkhoshle avatar May 06 '22 15:05 mkhoshle

It seems that torch.distributed.launch has been dropped in new version of Pytorch. In latest version, they use the torchrun instead. I have tested that this will work

CUDA_VISIBLE_DEVICES=${GPUS} nohup torchrun --nproc_per_node=4 -m romp.train --GPUS=${GPUS} --configs_yml=${TRAIN_CONFIGS} --distributed_training=1 > '../log/'${TAB}'_'${DATASET}'_g'${GPUS}.log 2>&1 &

Arthur151 avatar May 07 '22 01:05 Arthur151

@Arthur151 Ok replaced torch.distributed.launch with torchrun. When running my code I get the following error:

Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Error processing line 1 of /z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/google_auth-2.6.2-py3.10-nspkg.pth:

Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 169, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'types'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 580, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 567, in main
    known_paths = addsitepackages(known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 350, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 208, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 179, in addpackage
    import traceback
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/traceback.py", line 5, in <module>
    import linecache
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/linecache.py", line 11, in <module>
    import tokenize
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/tokenize.py", line 32, in <module>
    import re
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/re.py", line 124, in <module>
    import enum
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/enum.py", line 2, in <module>
    from types import MappingProxyType, DynamicClassAttribute
ModuleNotFoundError: No module named 'types'
Error processing line 1 of /z/home/mahzad-khosh/env/romp/lib/python3.8/site-packages/google_auth-2.6.2-py3.10-nspkg.pth:

Fatal Python error: init_import_size: Failed to import the site module
Python runtime state: initialized
Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 169, in addpackage
    exec(line)
  File "<string>", line 1, in <module>
ModuleNotFoundError: No module named 'types'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 580, in <module>
    main()
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 567, in main
    known_paths = addsitepackages(known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 350, in addsitepackages
    addsitedir(sitedir, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 208, in addsitedir
    addpackage(sitedir, name, known_paths)
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/site.py", line 179, in addpackage
    import traceback
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/traceback.py", line 5, in <module>
    import linecache
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/linecache.py", line 11, in <module>
    import tokenize
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/tokenize.py", line 32, in <module>
    import re
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/re.py", line 124, in <module>
    import enum
  File "/z/home/mahzad-khosh/env/romp/lib/python3.8/enum.py", line 2, in <module>
    from types import MappingProxyType, DynamicClassAttribute
ModuleNotFoundError: No module named 'types'

mkhoshle avatar May 09 '22 14:05 mkhoshle

This is pretty weird. You don't have this basic python package? please try import types

Arthur151 avatar May 09 '22 15:05 Arthur151

@Arthur151 Ok I have cuda 10.2, pytorch==1.10.0, torchvision==0.11.1 and I am getting the error: /z/home/mahzad-khosh/env/romp/bin/python: No module named torchrun. My python version is 3.8.13.

mkhoshle avatar May 09 '22 16:05 mkhoshle