'assert (boxes1[:, 2:] >= boxes1[:, :2]).all()' happened when training
Thanks for your great work!! When I applied AMP training on detectron2, I found some issues with boxes in the training.
Changed
The difference from the original code is here.
SOLVER:
STEPS: (210000, 250000)
MAX_ITER: 270000
AMP:
ENABLED: true
Error
[05/24 20:54:12 d2.engine.hooks]: Total training time: 0:00:10 (0:00:00 on hooks)
[05/24 20:54:12 d2.utils.events]: iter: 0 lr: N/A max_mem: 5095M
Traceback (most recent call last):
File "train_net.py", line 134, in <module>
launch(
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 55, in launch
mp.spawn(
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/train_net.py", line 128, in main
return trainer.train()
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
super().train(self.start_iter, self.max_iter)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
self.run_step()
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
self._trainer.run_step()
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 332, in run_step
loss_dict = self.model(data)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py", line 143, in forward
loss_dict = self.criterion(output, targets)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 147, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/masato/anaconda3/envs/protor/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 266, in forward
cost_giou = -generalized_box_iou(out_bbox, tgt_bbox)
File "/home/masato/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/util/box_ops.py", line 51, in generalized_box_iou
assert (boxes1[:, 2:] >= boxes1[:, :2]).all()
AssertionError
Decreasing the learning rate doesn't work for me and this error occurs only mix training. Is there any good suggestion to solve this problem?
Thank you.
Hi~ Can you try to delete giou, including matching and loss, to see whether this error still occurs?
@PeizeSun Thank you for your suggestion. After commenting out the giou, i got a new error...
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/launch.py", line 94, in _distributed_worker
main_func(*args)
File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/train_net.py", line 128, in main
return trainer.train()
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 431, in train
super().train(self.start_iter, self.max_iter)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 138, in train
self.run_step()
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 441, in run_step
self._trainer.run_step()
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 332, in run_step
loss_dict = self.model(data)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/detector.py", line 143, in forward
loss_dict = self.criterion(output, targets)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 147, in forward
indices = self.matcher(outputs_without_aux, targets)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 274, in forward
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
File "/home/fujitake/works/SparseR-CNN/projects/SparseRCNN/sparsercnn/loss.py", line 274, in <listcomp>
indices = [linear_sum_assignment(c[i]) for i, c in enumerate(C.split(sizes, -1))]
File "/home/fujitake/anaconda3/envs/protor/lib/python3.8/site-packages/scipy/optimize/_lsap.py", line 101, in linear_sum_assignment
a, b = _lsap_module.calculate_assignment(cost_matrix.T)
ValueError: matrix contains invalid numeric entries
Can you print out cost_matrix to see which entry is invalid?
I am getting the same issue when I try to run Sparse-RCNN with a learning rate of 0.02 for 8 GPUs. Did you find the solution to this problem?. It would be of great help if you actually did solve the problem. @Swall0w .