research-contributions icon indicating copy to clipboard operation
research-contributions copied to clipboard

Why is the process loading the data killed?

Open fuyuchenIfyw opened this issue 3 years ago • 3 comments

Describe the bug Hello, I met a bug about cacheDataset when I follow the training way provided by the research-contributions/DiNTS/train_multi-gpu.py. I used the MSD 03 liver dataset, when using the cachedataset to load the data, I encountered this problem in the middle of the loading: `WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239172 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 239173 closing signal SIGTERM ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 239174) of binary: /home/fyc/anaconda3/envs/cv/bin/python Traceback (most recent call last): File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/fyc/anaconda3/envs/cv/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in main() File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main launch(args) File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch run(args) File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run elastic_launch( File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/home/fyc/anaconda3/envs/cv/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

train_multi-gpu.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2022-12-12_16:06:40 host : amax rank : 2 (local_rank: 2) exitcode : -9 (pid: 239174) error_file: <N/A> traceback : Signal 9 (SIGKILL) received by PID 239174 =======================================================`

To Reproduce Steps to reproduce the behavior:

  1. Go to 'research-contributions/DiNTS'
  2. Install dependencies
  3. Run commands 'bash run_train_multi-gpu.sh' I did'nt follow the README.md to use docker, I use conda env instead.

Expected behavior When I set cache rate of cachedataset to 0.0, it can be trained normally. But when I set cache rate to 1.0, the error will occur.

Screenshots image

Environment (please complete the following information):

  • OS Ubuntu 18.04.5 LTS
  • Python version 3.8
  • MONAI version 1.0.1
  • CUDA/cuDNN version 11.7
  • GPU models and configuration 3 RTX 3090

Additional context Add any other context about the problem here.

fuyuchenIfyw avatar Dec 12 '22 08:12 fuyuchenIfyw

Hi @fuyuchenIfyw, I have the same problem. Could you manage to figure it out?

Jamshidhsp avatar Mar 31 '23 14:03 Jamshidhsp

I'm sorry that I ultimately didn't solve the problem. I believe it may have been due to insufficient hardware resources. In the end, I gave up on using MONAI and instead opted to develop with the GitHub - MIC-DKFZ/nnUNet framework. 

Echo @.***

 

------------------ 原始邮件 ------------------ 发件人: @.>; 发送时间: 2023年3月31日(星期五) 晚上11:00 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [Project-MONAI/research-contributions] Why is the process loading the data killed? (Issue #157)

Hi @fuyuchenIfyw, I have the same problem. Could you manage to figure it out?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

fuyuchenIfyw avatar Apr 01 '23 04:04 fuyuchenIfyw

I have the same problem.

loading about 300 3d images i got the following error

monai.transforms.croppad.dictionary CropForegroundd.init:allow_smaller: Current default value of argument allow_smaller=True has been deprecated since version 1.2. It will be changed to allow_smaller=False in version 1.5. Loading dataset: 80%|████████████████████▋ | 207/260 [27:33<09:36, 10.87s/it] Killed

Any idea ?????

ancia290 avatar Apr 04 '24 22:04 ancia290