nnDetection [Bug] RuntimeError: CUDA out of memory

:skull: Bug

Hello, Whenever I train nnDetection on my data, the network first trains for a few epochs and then I get the following error "RuntimeError: CUDA out of memory. Tried to allocate 436.00 MiB (GPU 0; 10.76 GiB total capacity; 7.83 GiB already allocated; 345.44 MiB free; 9.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF" (full error message below). I am using a rtx2080 gpu and 12 threads for augmentations. Some of my data images contain nearly 200 instances, could that be an issue since the error occurs when the code is trying to compute the iou between 2 sets of boxes (of sizes N and M) as a NxM tensor (if there are too many instances then the tensor might be too big). How can I solve that problem ?

Traceback (most recent call last): File "/home/ma2257/.conda/envs/nndet_v2/bin/nndet_train", line 33, in sys.exit(load_entry_point('nndet', 'console_scripts', 'nndet_train')()) File "/home/ma2257/new_nnDetection/nnDetection/nndet/utils/check.py", line 58, in wrapper return func(*args, **kwargs) File "/home/ma2257/new_nnDetection/nnDetection/scripts/train.py", line 69, in train _train( File "/home/ma2257/new_nnDetection/nnDetection/scripts/train.py", line 289, in _train trainer.fit(module, datamodule=datamodule) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 553, in fit self._run(model) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 918, in _run self._dispatch() File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 986, in _dispatch self.accelerator.start_training(self) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 92, in start_training self.training_type_plugin.start_training(trainer) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 161, in start_training self._results = trainer.run_stage() File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 996, in run_stage return self._run_train() File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1045, in _run_train self.fit_loop.run() File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/fit_loop.py", line 200, in advance epoch_output = self.epoch_loop.run(train_dataloader) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 130, in advance batch_output = self.batch_loop.run(batch, self.iteration_count, self._dataloader_idx) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 101, in run super().run(batch, batch_idx, dataloader_idx) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/base.py", line 111, in run self.advance(*args, **kwargs) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 148, in advance result = self._run_optimization(batch_idx, split_batch, opt_idx, optimizer) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 202, in _run_optimization self._optimizer_step(optimizer, opt_idx, batch_idx, closure) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 396, in _optimizer_step model_ref.optimizer_step( File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/core/lightning.py", line 1618, in optimizer_step optimizer.step(closure=optimizer_closure) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 209, in step self.__optimizer_step(*args, closure=closure, profiler_name=profiler_name, **kwargs) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/core/optimizer.py", line 129, in __optimizer_step trainer.accelerator.optimizer_step(optimizer, self._optimizer_idx, lambda_closure=closure, **kwargs) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 292, in optimizer_step make_optimizer_step = self.precision_plugin.pre_optimizer_step( File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/plugins/precision/native_amp.py", line 59, in pre_optimizer_step result = lambda_closure() File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 236, in _training_step_and_backward_closure result = self.training_step_and_backward(split_batch, batch_idx, opt_idx, optimizer, hiddens) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 537, in training_step_and_backward result = self._training_step(split_batch, batch_idx, opt_idx, hiddens) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/loops/batch/training_batch_loop.py", line 307, in _training_step training_step_output = self.trainer.accelerator.training_step(step_kwargs) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 193, in training_step return self.training_type_plugin.training_step(*step_kwargs.values()) File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 172, in training_step return self.model.training_step(*args, **kwargs) File "/home/ma2257/new_nnDetection/nnDetection/nndet/ptmodule/retinaunet/base.py", line 146, in training_step losses, _ = self.model.train_step( File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/retina.py", line 125, in train_step labels, matched_gt_boxes = self.assign_targets_to_anchors( File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context return func(*args, **kwargs) File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/retina.py", line 249, in assign_targets_to_anchors match_quality_matrix, matched_idxs = self.proposal_matcher( File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/matcher.py", line 93, in call return self.compute_matches( File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/matcher.py", line 302, in compute_matches match_quality_matrix = self.similarity_fn(boxes, anchors) # [num_boxes x anchors] File "/home/ma2257/.conda/envs/nndet_v2/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 12, in decorate_autocast return func(*args, **kwargs) File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/ops.py", line 102, in box_iou return box_iou_union_3d(boxes1.float(), boxes2.float(), eps=eps)[0] File "/home/ma2257/new_nnDetection/nnDetection/nndet/core/boxes/ops.py", line 157, in box_iou_union_3d inter = ((x2 - x1).clamp(min=0) * (y2 - y1).clamp(min=0) * (z2 - z1).clamp(min=0)) + eps # [N, M] RuntimeError: CUDA out of memory. Tried to allocate 436.00 MiB (GPU 0; 10.76 GiB total capacity; 7.83 GiB already allocated; 345.44 MiB free; 9.38 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm bypassing sigterm

Best regards, Joe

Aug 10 '22 16:08 JoeNajm

Dear @JoeNajm ,

thank you for the problem report. Indeed, a very high number of objects might be rather difficult to handle right now: the planning stage already includes an approximation to determine the maximum number of objects in a patch which is seen during training. Since we can only approximate the number, it might underestimate the consumption if there is an extremely large amount of objects. The easiest fix, is to increase the offset here: https://github.com/MIC-DKFZ/nnDetection/blob/c45a49dc7044cd061f2a0e9efc0b0e331485f3d1/nndet/planning/estimator.py#L68

This will increase the safety buffer during the estimation and thus yield a smaller patch size (and potentially adapted architectures). After changing the line, it is necessary to rerun the planning the stage (nndet_prep XXX -o prep=nothing prep.plan=True.

Best, Michael

Aug 11 '22 07:08 mibaumgartner

Dear @mibaumgartner

Thank you for your response, I've been trying to increase the offset (went from 768 to 1100) but I still get the error (although less frequently). I was wondering what is the maximum offset I can choose? (I am using a nvidia GeForce RTX:2080 with a memory of 11019MiB)

Best, Joe

Aug 18 '22 11:08 JoeNajm

Hi @JoeNajm ,

Did you find a solution for this?

Feb 01 '23 16:02 sovanlal

This issue is stale because it has been open for 30 days with no activity.

Jan 02 '24 11:01 github-actions[bot]

This issue was closed because it has been inactive for 14 days since being marked as stale.

Jan 17 '24 00:01 github-actions[bot]