PyTorch-Encoding icon indicating copy to clipboard operation
PyTorch-Encoding copied to clipboard

Exception: process 0 terminated with signal SIGKILL`

Open dylanqyuan opened this issue 5 years ago • 6 comments

Hi,

dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch

the following code is run on single GPU( GeForce RTX 2080, 8GB): CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss

the following issue happened:

Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL

I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

dylanqyuan avatar Aug 31 '20 12:08 dylanqyuan

That looks like a pytorch issue. I haven't seen this before.

zhanghang1989 avatar Aug 31 '20 15:08 zhanghang1989

Hi,

dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch

the following code is run on single GPU( GeForce RTX 2080, 8GB): CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss

the following issue happened:

Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL

I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

I had this problem too. Did you solve it?

anewusername77 avatar Aug 26 '21 13:08 anewusername77

Do you have more than 1 gpu? That may be the issue of doing distributed training. I haven’t tried 1 gpu

zhanghang1989 avatar Aug 26 '21 15:08 zhanghang1989

Hi, dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch the following code is run on single GPU( GeForce RTX 2080, 8GB): CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss the following issue happened: Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

I had this problem too. Did you solve it?

Have you solved it? I have a similar problem.I only use one GPU.

zura-false avatar Oct 07 '21 12:10 zura-false

Hi, dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch the following code is run on single GPU( GeForce RTX 2080, 8GB): CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss the following issue happened: Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!

I had this problem too. Did you solve it?

Have you solved it? I have a similar problem.I only use one GPU.

If only using 1 gpu, avoid using mp.spawn. Call main_worker(gpu, ngpus_per_node, args) directly. https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train_dist.py#L148

zhanghang1989 avatar Oct 07 '21 20:10 zhanghang1989

您好, 依赖项: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master 分支 以下代码在单 GPU(GeForce RTX 2080、8GB )上运行: CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss 发生以下问题: Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL 我已阅读所有相关问题,但问题仍然存在那里,你能给我一些建议吗?谢谢你的时间!

我也有这个问题。你解决了吗?

你解决了吗?我有一个类似的问题。我只使用一个 GPU。

如果只使用 1 gpu,请避免使用mp.spawn. main_worker(gpu, ngpus_per_node, args)直接打电话。 https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train_dist.py#L148

Thank you very much. I'll try it

zura-false avatar Oct 08 '21 14:10 zura-false