:skull: Bug
Hello,
Whenever I train nnDetection on my data, the network first trains for a few epochs and then I get the following error "RuntimeError: CUDA out of memory. Tried to allocate 436.00 MiB (GPU 0; 10.76 GiB total capacity; 7.83 GiB already allocated; 345.44 MiB free; 9.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF" (full error message below). I am using a rtx2080 gpu and 12 threads for augmentations.
Some of my data images contain nearly 200 instances, could that be an issue since the error occurs when the code is trying to compute the iou between 2 sets of boxes (of sizes N and M) as a NxM tensor (if there are too many instances then the tensor might be too big). How can I solve that problem ?
Traceback (most recent call last):
File "/home/ma2257/.conda/envs/nndet_v2/bin/nndet_train", line 33, in
sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_train')())
File "/home/ma2257/new_nnDetection/nnDetection/nndet/utils/check.py", line 58, in wrapper
return func(*args, **kwargs)
File "/home/ma2257/new_nnDetection/nnDetection/scripts/train.py", line 69, in train
_train(
File "/home/ma2257/new_nnDetection/nnDetection/scripts/train.py", line 289, in _train
trainer.fit(module, datamodule=datamodule)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit
self._run(model)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run
self._dispatch()
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch
self.accelerator.start_training(self)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training
self._results = trainer.run_stage()
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage
return self._run_train()
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train
self.fit_loop.run()
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance
epoch_output = self.epoch_loop.run(train_dataloader)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance
batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run
super().run(batch, batch_idx, dataloader_idx)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run
self.advance(*args, **kwargs)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance
result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization
self._optimizer_step(optimizer, opt_idx, batch_idx, closure)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step
model_ref.optimizer_step(
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step
optimizer.step(closure=optimizer_closure)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step
self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step
trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 292, in optimizer_step
make_optimizer_step = self.precision_plugin.pre_optimizer_step(
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 59, in pre_optimizer_step
result = lambda_closure()
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure
result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 537, in training_step_and_backward
result = self._training_step(split_batch, batch_idx, opt_idx, hiddens)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 307, in _training_step
training_step_output = self.trainer.accelerator.training_step(step_kwargs)
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step
return self.training_type_plugin.training_step(*step_kwargs.values())
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 172, in training_step
return self.model.training_step(*args, **kwargs)
File "/home/ma2257/new_nnDetection/nnDetection/nndet/ptmodule/retinaunet/base.py", line 146, in training_step
losses, _ = self.model.train_step(
File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/retina.py", line 125, in train_step
labels, matched_gt_boxes = self.assign_targets_to_anchors(
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/retina.py", line 249, in assign_targets_to_anchors
match_quality_matrix, matched_idxs = self.proposal_matcher(
File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/matcher.py", line 93, in call
return self.compute_matches(
File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/matcher.py", line 302, in compute_matches
match_quality_matrix = self.similarity_fn(boxes, anchors) # [num_boxes x anchors]
File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast
return func(*args, **kwargs)
File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/ops.py", line 102, in box_iou
return box_iou_union_3d(boxes1.float(), boxes2.float(), eps=eps)[0]
File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/ops.py", line 157, in box_iou_union_3d
inter = ((x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0) * (z2 - z1).clamp(min=0)) + eps # [N, M]
RuntimeError: CUDA out of memory. Tried to allocate 436.00 MiB (GPU 0; 10.76 GiB total capacity; 7.83 GiB already allocated; 345.44 MiB free; 9.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
bypassing sigterm
Best regards,
Joe
Dear @JoeNajm ,
thank you for the problem report. Indeed, a very high number of objects might be rather difficult to handle right now: the planning stage already includes an approximation to determine the maximum number of objects in a patch which is seen during training. Since we can only approximate the number, it might underestimate the consumption if there is an extremely large amount of objects. The easiest fix, is to increase the offset here: https://github.com/MIC-DKFZ/nnDetection/blob/c45a49dc7044cd061f2a0e9efc0b0e331485f3d1/nndet/planning/estimator.py#L68
This will increase the safety buffer during the estimation and thus yield a smaller patch size (and potentially adapted architectures). After changing the line, it is necessary to rerun the planning the stage (nndet_prep XXX -o prep=nothing prep.plan=True.
Best,
Michael
Dear @mibaumgartner
Thank you for your response, I've been trying to increase the offset (went from 768 to 1100) but I still get the error (although less frequently). I was wondering what is the maximum offset I can choose?
(I am using a nvidia GeForce RTX:2080 with a memory of 11019MiB)
Best,
Joe