UniAD icon indicating copy to clipboard operation
UniAD copied to clipboard

RuntimeError:DataLoader worker (pid 33959) is killed by signal: Killed.indices should be either on cpu or on the same device as the indexed tensor (cpu)

Open deffery opened this issue 4 months ago • 2 comments

#I encountered this problem during the training of stage 2. Traceback (most recent call last): File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1134, in _try_get_data data = self._data_queue.get(timeout=timeout) File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 107, in get if not self._poll(timeout): File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 257, in poll return self._poll(timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 424, in _poll r = wait([self], timeout) File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 936, in wait timeout = deadline - time.monotonic() File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/_utils/signal_handling.py", line 66, in handler _error_if_any_worker_fails() RuntimeError: DataLoader worker (pid 33959) is killed by signal: Killed. Traceback (most recent call last): File "/home/gpu4/UniAD/./tools/train.py", line 256, in main() File "/home/gpu4/UniAD/./tools/train.py", line 245, in main custom_train_model( File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/apis/train.py", line 21, in custom_train_model custom_train_detector( File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/apis/mmdet_train.py", line 194, in custom_train_detector runner.run(data_loaders, cfg.workflow) File "/home/gpu4/mmcv/mmcv/runner/epoch_based_runner.py", line 136, in run epoch_runner(data_loaders[i], **kwargs) File "/home/gpu4/mmcv/mmcv/runner/epoch_based_runner.py", line 53, in train self.run_iter(data_batch, train_mode=True, **kwargs) File "/home/gpu4/mmcv/mmcv/runner/epoch_based_runner.py", line 31, in run_iter outputs = self.model.train_step(data_batch, self.optimizer, File "/home/gpu4/mmcv/mmcv/parallel/distributed.py", line 63, in train_step output = self.module.train_step(*inputs[0], **kwargs[0]) File "/home/gpu4/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/mmdet/models/detectors/base.py", line 248, in train_step losses = self(**data) File "/home/gpu4/miniconda3/envs/uniad2.0/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, **kwargs) File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_e2e.py", line 81, in forward return self.forward_train(**kwargs) File "/home/gpu4/mmcv/mmcv/runner/fp16_utils.py", line 116, in new_func return old_func(*args, **kwargs) File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/detectors/uniad_e2e.py", line 187, in forward_train ret_dict_motion = self.motion_head.forward_train(bev_embed, File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/motion_head.py", line 137, in forward_train losses = self.loss(*loss_inputs) File "/home/gpu4/mmcv/mmcv/runner/fp16_utils.py", line 205, in new_func return old_func(*args, **kwargs) File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/motion_head.py", line 416, in loss gt_fut_traj_all, gt_fut_traj_mask_all = self.compute_matched_gt_traj( File "/home/gpu4/UniAD/projects/mmdet3d_plugin/uniad/dense_heads/motion_head.py", line 475, in compute_matched_gt_traj bboxes = track_bbox_results[i][0].tensor[valid_traj_masks] RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu)

deffery avatar Sep 23 '25 06:09 deffery

发现问题关键---- “用CPU 索引去取 GPU 张量” :RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) motion_head.py中:

  • 1 . compute_matched_gt_traj函数的: matched_gt_bboxes_3d = gt_bboxes_3d[i][-1].tensor[matched_gt_idx[:-1]][valid_traj_masks[:-1]] matched_gt_idx 是在 CPU 上的张量(all_matched_idxes 默认放在 CPU),而 gt_bboxes_3d[i][-1].tensor 在 GPU 上。PyTorch 要求“索引张量”与“被索引张量”必须位于同一设备,否则抛: RuntimeError: indices should be either on cpu or on the same device as the indexed tensor (cpu) 把索引搬到 GPU 即可,最简单做法就是在进入函数后统一把 matched_gt_idx 放到对应设备。
  • 2 .同一函数、同一公式、SDC 分支
sdc_gt_fut_traj = matched_gt_fut_traj[-1:]
sdc_gt_fut_traj_mask = matched_gt_fut_traj_mask[-1:]

这两句本身不索引 gt_bboxes_3d,但bboxes = track_bbox_results[i][0].tensor[valid_traj_masks] valid_traj_masks 是根据 CPU 的 matched_gt_idx 算出来的, 所以 valid_traj_masks 也在 CPU,再去 mask GPU 的 .tensor 会再次触发同样错误。

  • 3 .valid_traj_masks = matched_gt_idx >= 0 这里的 matched_gt_idx 还是 CPU 版本, 但它后面要 mask 的 traj_scores / traj_preds 都在 GPU, 虽然目前还没直接拿 valid_traj_masks 去索引 GPU 张量, 可一旦后续改动或代码路径变化就可能出错。 修:把 valid_traj_masks 也搬到 GPU。
  • 4 .filter_vehicle_query 函数(forward_train 内嵌)
query_label = gt_labels_3d[0][-1][all_matched_idxes[0]]

all_matched_idxes[0] 默认在 CPU,而 gt_labels_3d[0][-1] 在 GPU。 修: query_label = gt_labels_3d[0][-1][all_matched_idxes[0].to(gt_labels_3d[0][-1].device)]

  • 5 .forward_test内嵌的 filter_vehicle_query vehicle_mask |= labels == veh_id labels 是从 track_bbox_results 里来的,本身在 GPU, 但 vehicle_mask 初始化在 CPU:vehicle_mask = torch.zeros_like(labels) # 默认 CPU

修改总结

# 1.compute_matched_gt_traj
matched_gt_idx = all_matched_idxes[i].to(gt_bboxes_3d[i][-1].tensor.device)
valid_traj_masks = matched_gt_idx >= 0

# 2. compute_loss_traj
matched_gt_idx = all_matched_idxes[i].to(traj_scores.device)
valid_traj_masks = matched_gt_idx >= 0

# 3. forward_train 里的 filter_vehicle_query
query_label = gt_labels_3d[0][-1][all_matched_idxes[0].to(gt_labels_3d[0][-1].device)]

# 4. forward_test 里的 filter_vehicle_query
vehicle_mask = torch.zeros_like(labels, device=labels.device)

deffery avatar Sep 23 '25 06:09 deffery

非常好的总结,只想问为啥这么明显的错误现在没有其他人指出来

wzh506 avatar Sep 25 '25 10:09 wzh506