Exception: process 0 terminated with signal SIGKILL`
Hi,
dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch
the following code is run on single GPU( GeForce RTX 2080, 8GB):
CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss
the following issue happened:
Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL
I have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!
That looks like a pytorch issue. I haven't seen this before.
Hi,
dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch
the following code is run on single GPU( GeForce RTX 2080, 8GB):
CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-lossthe following issue happened:
Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILLI have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!
I had this problem too. Did you solve it?
Do you have more than 1 gpu? That may be the issue of doing distributed training. I haven’t tried 1 gpu
Hi, dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch the following code is run on single GPU( GeForce RTX 2080, 8GB):
CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-lossthe following issue happened:Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILLI have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!I had this problem too. Did you solve it?
Have you solved it? I have a similar problem.I only use one GPU.
Hi, dependency: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master branch the following code is run on single GPU( GeForce RTX 2080, 8GB):
CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-lossthe following issue happened:Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILLI have read all the issues related to it, but the issue is still there,could you please give me some suggestions? thanks for your time!I had this problem too. Did you solve it?
Have you solved it? I have a similar problem.I only use one GPU.
If only using 1 gpu, avoid using mp.spawn. Call main_worker(gpu, ngpus_per_node, args) directly.
https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train_dist.py#L148
您好, 依赖项: pytorch 1.4.0, CUDA 10.2 Pytorch_encoding master 分支 以下代码在单 GPU(GeForce RTX 2080、8GB )上运行:
CUDA_VISIBLE_DEVICES=0 python train_dist.py --dataset PContext --model EncNet --aux --se-loss发生以下问题:Using poly LR scheduler with warm-up epochs of 0! Starting Epoch: 0 Total Epoches: 80 Traceback (most recent call last): File "train_dist.py", line 319, in <module> main() File "train_dist.py", line 148, in main mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn while not spawn_context.join(): File "/home/qyuan/anaconda3/envs/pytorch_encoding_interpreter/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 107, in join (error_index, name) Exception: process 0 terminated with signal SIGKILL我已阅读所有相关问题,但问题仍然存在那里,你能给我一些建议吗?谢谢你的时间!我也有这个问题。你解决了吗?
你解决了吗?我有一个类似的问题。我只使用一个 GPU。
如果只使用 1 gpu,请避免使用
mp.spawn.main_worker(gpu, ngpus_per_node, args)直接打电话。 https://github.com/zhanghang1989/PyTorch-Encoding/blob/master/experiments/segmentation/train_dist.py#L148
Thank you very much. I'll try it