Dataloader worker killed with runtime error.
Hello,
While training stage to network, im seeing the following error.
Is anyone seeing the same error?
Traceback (most recent call last):
File "./tools/train.py", line 256, in
CHILD PROCESS FAILED WITH NO ERROR_FILE
CHILD PROCESS FAILED WITH NO ERROR_FILE Child process 1099909 (local_rank 1) FAILED (exitcode 1) Error msg: Process failed with exitcode 1 Without writing an error file to <N/A>. While this DOES NOT affect the correctness of your application, no trace information about the error will be available for inspection. Consider decorating your top level entrypoint function with torch.distributed.elastic.multiprocessing.errors.record. Example:
from torch.distributed.elastic.multiprocessing.errors import record
@record def trainer_main(args): # do train
warnings.warn(_no_error_file_warning_msg(rank, failure))
Traceback (most recent call last):
File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/ubuntu/.conda/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 702, in
./tools/train.py FAILED
======================================= Root Cause: [0]: time: 2023-07-07_12:12:31 rank: 1 (local_rank: 1) exitcode: 1 (pid: 1099909) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: <NO_OTHER_FAILURES>
Thanks for your attention. I'm training this on an AWS EC2 instance (g5-12x) with 4 A10 gpus!
Regards, Venkat
hello, I have the same problem. Have you solved it?
(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1
projects.mmdet3d_plugin
Traceback (most recent call last):
File "./tools/train.py", line 256, in <module>
main()
File "./tools/train.py", line 173, in main
cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config)))
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump
f.write(self.pretty_text)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text
text, _ = FormatCode(text, style_config=yapf_style, verify=True)
TypeError: FormatCode() got an unexpected keyword argument 'verify'
/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torch.distributed.run.
Note that --use_env is set by default in torch.distributed.run.
If your script expects `--local_rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python
Traceback (most recent call last):
File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run
elastic_launch(
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
***************************************
./tools/train.py FAILED
=======================================
Root Cause:
[0]:
time: 2023-10-27_14:46:08
rank: 0 (local_rank: 0)
exitcode: 1 (pid: 59905)
error_file: <N/A>
msg: "Process failed with exitcode 1"
=======================================
Other Failures:
<NO_OTHER_FAILURES>
***************************************
hello, I have the same problem. Have you solved it?
(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1 projects.mmdet3d_plugin Traceback (most recent call last): File "./tools/train.py", line 256, in <module> main() File "./tools/train.py", line 173, in main cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config))) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump f.write(self.pretty_text) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text text, _ = FormatCode(text, style_config=yapf_style, verify=True) TypeError: FormatCode() got an unexpected keyword argument 'verify' /usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects `--local_rank` argument to be set, please change it to read from `os.environ['LOCAL_RANK']` instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructions warnings.warn( ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in <module> main() File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** ./tools/train.py FAILED ======================================= Root Cause: [0]: time: 2023-10-27_14:46:08 rank: 0 (local_rank: 0) exitcode: 1 (pid: 59905) error_file: <N/A> msg: "Process failed with exitcode 1" ======================================= Other Failures: <NO_OTHER_FAILURES> ***************************************
Have you solved this ?
Hello,
I did solve this problem. May I ask when you are hitting this issue?
If i remember correctly, I was hitting this issue during validation check and i needed to enable the following flag which fixed it.
NCCL_P2P_DISABLE=1
Thanks Venkat
On Mon, Jan 8, 2024 at 7:48 AM xiexu666 @.***> wrote:
hello, I have the same problem. Have you solved it?
(uniad) ➜ UniAD git:(dev) ./tools/uniad_dist_train.sh ./projects/configs/stage1_track_map/base_track_map.py 1 projects.mmdet3d_plugin Traceback (most recent call last): File "./tools/train.py", line 256, in
main() File "./tools/train.py", line 173, in main cfg.dump(osp.join(cfg.work_dir, osp.basename(args.config))) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 541, in dump f.write(self.pretty_text) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/mmcv/utils/config.py", line 496, in pretty_text text, _ = FormatCode(text, style_config=yapf_style, verify=True) TypeError: FormatCode() got an unexpected keyword argument 'verify' /usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py:178: FutureWarning: The module torch.distributed.launch is deprecated and will be removed in future. Use torch.distributed.run. Note that --use_env is set by default in torch.distributed.run. If your script expects --local_rankargument to be set, please change it to read fromos.environ['LOCAL_RANK']instead. See https://pytorch.org/docs/stable/distributed.html#launch-utility for further instructionswarnings.warn( ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 59905) of binary: /usr/miniconda3/envs/uniad/bin/python Traceback (most recent call last): File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/miniconda3/envs/uniad/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 193, in
main() File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 189, in main launch(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launch.py", line 174, in launch run(args) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/run.py", line 689, in run elastic_launch( File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/usr/miniconda3/envs/uniad/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 244, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: *************************************** ./tools/train.py FAILED Root Cause: [0]: time: 2023-10-27_14:46:08 rank: 0 (local_rank: 0) exitcode: 1 (pid: 59905) error_file: <N/A> msg: "Process failed with exitcode 1"
Other Failures: <NO_OTHER_FAILURES>***************************************
Have you solved this ?
— Reply to this email directly, view it on GitHub https://github.com/OpenDriveLab/UniAD/issues/62#issuecomment-1880944181, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFNVYTRI3YNH2IRUJMC56U3YNPTJJAVCNFSM6AAAAAA2B3WXN2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOBQHE2DIMJYGE . You are receiving this because you authored the thread.Message ID: @.***>
@xiexu666 @daxiongpro
Hello,
execute the following command to resolve this problem:
$pip uninstall yapf
$pip install yapf==0.40.1
refer:https://blog.csdn.net/ZZZZ_Y_/article/details/133902230