RuntimeError:DataLoader worker (pid 33959) is killed by signal: Killed.indices should be either on cpu or on the same device as the indexed tensor (cpu)
#I encountered this problem during the training of stage 2.
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get
if not self._poll(timeout):
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll
return self._poll(timeout)
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll
r = wait([self], timeout)
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 936, in wait
timeout = deadline - time.monotonic()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler
_error_if_any_worker_fails()
RuntimeError: DataLoader worker (pid 33959) is killed by signal: Killed.
Traceback (most recent call last):
File "/home/gpu4/UniAD/./tools/train.py", line 256, in
发现问题关键---- “用CPU 索引去取 GPU 张量” :RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)
motion_head.py中:
- 1 .
compute_matched_gt_traj函数的:matched_gt_bboxes_3d = gt_bboxes_3d[i][-1].tensor[matched_gt_idx[:-1]][valid_traj_masks[:-1]]matched_gt_idx是在 CPU 上的张量(all_matched_idxes默认放在 CPU),而gt_bboxes_3d[i][-1].tensor在 GPU 上。PyTorch 要求“索引张量”与“被索引张量”必须位于同一设备,否则抛:RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)把索引搬到 GPU 即可,最简单做法就是在进入函数后统一把matched_gt_idx放到对应设备。
- 2 .同一函数、同一公式、
SDC分支
sdc_gt_fut_traj = matched_gt_fut_traj[-1:]
sdc_gt_fut_traj_mask = matched_gt_fut_traj_mask[-1:]
这两句本身不索引 gt_bboxes_3d,但bboxes = track_bbox_results[i][0].tensor[valid_traj_masks]
valid_traj_masks 是根据 CPU 的 matched_gt_idx 算出来的,
所以 valid_traj_masks 也在 CPU,再去 mask GPU 的 .tensor 会再次触发同样错误。
- 3 .
valid_traj_masks = matched_gt_idx >= 0这里的matched_gt_idx还是 CPU 版本, 但它后面要 mask 的traj_scores / traj_preds都在 GPU, 虽然目前还没直接拿valid_traj_masks去索引 GPU 张量, 可一旦后续改动或代码路径变化就可能出错。 修:把valid_traj_masks也搬到 GPU。 - 4 .
filter_vehicle_query函数(forward_train 内嵌)
query_label = gt_labels_3d[0][-1][all_matched_idxes[0]]
all_matched_idxes[0] 默认在 CPU,而 gt_labels_3d[0][-1] 在 GPU。
修:
query_label = gt_labels_3d[0][-1][all_matched_idxes[0].to(gt_labels_3d[0][-1].device)]
- 5 .
forward_test内嵌的filter_vehicle_queryvehicle_mask |= labels == veh_idlabels 是从track_bbox_results里来的,本身在 GPU, 但vehicle_mask初始化在CPU:vehicle_mask = torch.zeros_like(labels)# 默认 CPU
修改总结
# 1.compute_matched_gt_traj
matched_gt_idx = all_matched_idxes[i].to(gt_bboxes_3d[i][-1].tensor.device)
valid_traj_masks = matched_gt_idx >= 0
# 2. compute_loss_traj
matched_gt_idx = all_matched_idxes[i].to(traj_scores.device)
valid_traj_masks = matched_gt_idx >= 0
# 3. forward_train 里的 filter_vehicle_query
query_label = gt_labels_3d[0][-1][all_matched_idxes[0].to(gt_labels_3d[0][-1].device)]
# 4. forward_test 里的 filter_vehicle_query
vehicle_mask = torch.zeros_like(labels, device=labels.device)
非常好的总结,只想问为啥这么明显的错误现在没有其他人指出来