PyTorch-Encoding Exception: process 0 terminated with signal SIGKILL`

Hi,

dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch

the following code is run on single GPU( GeForce RTX 2080, 8GB)： CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss

the following issue happened:

Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL

I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

Aug 31 '20 12:08 dylanqyuan

That looks like a pytorch issue. I haven't seen this before.

Aug 31 '20 15:08 zhanghang1989

Hi,

dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch

the following code is run on single GPU( GeForce RTX 2080, 8GB)： CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss

the following issue happened:

Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL

I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

I had this problem too. Did you solve it?

Aug 26 '21 13:08 anewusername77

Do you have more than 1 gpu? That may be the issue of doing distributed training. I haven’t tried 1 gpu

Aug 26 '21 15:08 zhanghang1989

Hi, dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch the following code is run on single GPU( GeForce RTX 2080, 8GB)： CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss the following issue happened: Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

I had this problem too. Did you solve it?

Have you solved it? I have a similar problem.I only use one GPU.

Oct 07 '21 12:10 zura-false

Hi, dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch the following code is run on single GPU( GeForce RTX 2080, 8GB)： CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss the following issue happened: Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

I had this problem too. Did you solve it?

Have you solved it? I have a similar problem.I only use one GPU.

If only using 1 gpu, avoid using mp.spawn. Call main_worker(gpu, ngpus_per_node, args) directly. https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train_dist.py#L148

Oct 07 '21 20:10 zhanghang1989

您好，依赖项： pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master 分支以下代码在单 GPU（GeForce RTX 2080、8GB ）上运行： CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss 发生以下问题： Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL 我已阅读所有相关问题，但问题仍然存在那里，你能给我一些建议吗？谢谢你的时间！

我也有这个问题。你解决了吗？

你解决了吗？我有一个类似的问题。我只使用一个 GPU。

如果只使用 1 gpu，请避免使用mp.spawn. main_worker(gpu, ngpus_per_node, args)直接打电话。 https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train_dist.py#L148

Thank you very much. I'll try it

Oct 08 '21 14:10 zura-false