运行错误
subprocess.CalledProcessError: Command '['/home/gxu4090x2/.conda/envs/cat/bin/python3.11', '-m', 'torch.distributed.launch', '--nproc_per_node=1', '--master_port=26968', '/home/gxu4090x2/.conda/envs/cat/lib/python3.11/site-packages/mmdet/.mim/tools/train.py', 'configs/dior/catnet_r50_3x_dior.py', '--launcher', 'pytorch']' returned non-zero exit status 1. 请问遇到这个问题该怎么解决?
Could you please provide more details about the error log and the environment?
报错信息:
Using port 22203 for synchronization.
Training command is /home/gxu4090x2/.conda/envs/sod/bin/python3.11 -m torch.distributed.launch --nproc_per_node=1 --master_port=22203 /home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py configs/dior/catnet_r50_3x_dior.py --launcher pytorch.
/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects --local-rank argument to be set, please
change it to read from os.environ['LOCAL_RANK'] instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
A module that was compiled using NumPy 1.x cannot be run in NumPy 2.1.1 as it may crash. To support both 1.x and 2.x versions of NumPy, modules must be compiled with NumPy 2.0. Some module may need to rebuild instead e.g. with 'pybind11>=2.12'.
If you are a user of the module, the easiest solution will be to downgrade to 'numpy<2' or try to upgrade the affected module. We expect that some modules will need time to support NumPy 2.
Traceback (most recent call last): File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py", line 10, in
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/optim/init.py", line 30, in
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmengine/config/config.py", line 182, in fromfile import_modules_from_strings(**cfg_dict['custom_imports']) File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmengine/utils/misc.py", line 84, in import_modules_from_strings raise ImportError(f'Failed to import {imp}') ImportError: Failed to import models
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py", line 133, in
main()
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py", line 70, in main
cfg = Config.fromfile(args.config)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmengine/config/config.py", line 192, in fromfile
raise ImportError(err_msg) from e
ImportError: Failed to import custom modules from {'imports': ['models', 'datasets']}, the current sys.path is:
/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools
/home/gxu4090x2/桌面/lqh/program/CATNet
/home/gxu4090x2/.conda/envs/sod/lib/python311.zip
/home/gxu4090x2/.conda/envs/sod/lib/python3.11
/home/gxu4090x2/.conda/envs/sod/lib/python3.11/lib-dynload
/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages
/tmp/tmpu97o2siy
You should set PYTHONPATH to make sys.path include the directory which contains your custom module
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 629591) of binary: /home/gxu4090x2/.conda/envs/sod/bin/python3.11
Traceback (most recent call last):
File "", line 198, in _run_module_as_main
File "", line 88, in _run_code
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launch.py", line 196, in
main()
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
PYTHONPATH to make sys.path include the directory which contains your custom module
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 629591) of binary: /home/gxu4090x2/.conda/envs/sod/bin/python3.11
Traceback (most recent call last):
File "/home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages/mmdet/.mim/tools/train.py FAILED
Failures: <NO_OTHER_FAILURES>
Root Cause (first observed failure): [0]: time : 2024-09-13_16:10:02 host : gxu4090x2-ubuntu rank : 0 (local_rank: 0) exitcode : 1 (pid: 629591) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
File "/home/gxu4090x2/.conda/envs/sod/bin/mim", line 8, in
Looks like the problem from mmcv side. Please make sure mmcv is correctly installed.
After inputting the command: pip show mmcv,the details are: Name: mmcv Version: 2.0.1 Summary: OpenMMLab Computer Vision Foundation Home-page: https://github.com/open-mmlab/mmcv Author: MMCV Contributors Author-email: [email protected] License: Location: /home/gxu4090x2/.conda/envs/sod/lib/python3.11/site-packages Requires: addict, mmengine, numpy, opencv-python, packaging, Pillow, pyyaml, yapf Required-by: How to know whether mmcv is correctly installed?
Your error log says ModuleNotFoundError: No module named 'mmcv._ext', which means the CUDA extensions were not compiled successfully, only the Python part was installed. You may refer to mmcv's repo for details.
What should I do?Since mmcv is installed correctly.
Please create an issue in mmcv's repo.