多卡训练时报错
Traceback (most recent call last): File "train.py", line 107, in <module> out = net(images) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/modules/module.py", line 547, in __call__ result = self.forward(*input, **kwargs) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward return self.gather(outputs, self.output_device) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather return gather(outputs, output_device, dim=self.dim) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather res = gather_map(outputs) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map return type(out)(map(gather_map, zip(*outputs))) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map return Gather.apply(target_device, dim, *outputs) File "/home/walker2/anaconda3/envs/pytorch1.2/lib/python3.6/site-packages/torch/nn/parallel/_functions.py", line 54, in forward assert all(map(lambda i: i.is_cuda, inputs)) AssertionError
在train.py里添加
import os os.environ["CUDA_VISIBLE_DEVICES"] = "0"
使用单卡时训练正常
我没看出来你多卡训练的报错是为啥…不过我试过多卡训练没啥问题…是不是没配置好?
我也遇到了一模一样的问题,解决了吗
我的也是只能在一张卡上训练,其余的卡并不能并行运行
SSD的forward中返回了在cpu上的的prior,这个在SSD的forward中没有作用,就是定义然后返回了,导致了问题。
将这个prior的定义在MultiBoxLoss中 就可以使用多GPU训练了
class MultiBoxLoss(nn.Module): def __init__(self, num_classes, overlap_thresh, prior_for_matching,bkg_label, neg_mining, neg_pos, neg_overlap, encode_target, gpu_num, negatives_for_hard=100.0): super(MultiBoxLoss, self).__init__() self.gpu_num = gpu_num self.num_classes = num_classes self.threshold = overlap_thresh self.background_label = bkg_label self.encode_target = encode_target self.use_prior_for_matching = prior_for_matching self.do_neg_mining = neg_mining self.negpos_ratio = neg_pos self.neg_overlap = neg_overlap self.negatives_for_hard = negatives_for_hard self.variance = [0.1,0.2] **with torch.no_grad():** **self.priors = Variable(PriorBox().forward())**

为啥这个prior会导致不能用多卡啊,因为cpu嘛
可能是多卡训练是在GPU上,但是prior是定义在cpu上的,可能没法分配到多个GPU上
0 0 竟然是如此,需要找个时间改。我太忙了