tutorials icon indicating copy to clipboard operation
tutorials copied to clipboard

Auto3Dseg cuda OOM during Ensembling

Open udiram opened this issue 2 years ago • 14 comments

Describe the bug Models have all finished training, and during the ensembling process, cuda runs out of memory.

Reproduce Steps to reproduce the behavior: Run Autorunner on AMOS22 dataset

manually resetting cuda cache, restarting kernel and instance all come back to this error.

Expected behavior training proceeds without error MONAI version: 1.2.0 Numpy version: 1.25.2 Pytorch version: 2.0.1+cu117 MONAI flags: HAS_EXT = False, USE_COMPILED = False, USE_META_DICT = False MONAI rev id: c33f1ba588ee00229a309000e888f9817b4f1934 MONAI file: /home/exouser/.local/lib/python3.10/site-packages/monai/init.py

Optional dependencies: Pytorch Ignite version: 0.4.11 ITK version: 5.3.0 Nibabel version: 5.1.0 scikit-image version: 0.21.0 Pillow version: 9.0.1 Tensorboard version: 2.14.0 gdown version: 4.7.1 TorchVision version: 0.15.2+cu117 tqdm version: 4.66.1 lmdb version: 1.4.1 psutil version: 5.9.0 pandas version: 2.0.3 einops version: 0.6.1 transformers version: 4.21.3 mlflow version: 2.6.0 pynrrd version: 1.0.0

Environment (please complete the following information): OS: ubuntu 22.04 Python 3.10.12 Driver Version: 525.85.05 CUDA Version: 12.0 GRID A100X-40C - 125GB RAM

image image

I'm happy to provide any other logs to help, this is the second time I've run into this issue, the issue persists after a full kernel restart and RAM clearing.

udiram avatar Sep 04 '23 15:09 udiram

Hi @dongyang0122, could you please share some comments here? Thanks in advance!

KumoLiu avatar Sep 05 '23 02:09 KumoLiu

hi @KumoLiu just following up on this, are there any other similar issues I could reference to trouble shoot? thanks!

udiram avatar Sep 07 '23 18:09 udiram

Hi @udiram, here are some similar issues you could refer to: https://github.com/Project-MONAI/tutorials/discussions/1089 https://github.com/Project-MONAI/tutorials/discussions/975 Thanks!

KumoLiu avatar Sep 08 '23 02:09 KumoLiu

hi Kumo, #1089 worked for me to get training going, and like I mentioned in that issue, the same fixes (i.e. setting the spacing in swinunetr to 1.5, 1.5, 1.5), so thanks for this!

I am, however, still running into the ensembling issue that doesn't seem to be addressed in #975 specifically. The good thing about the crash happening so late is that the inferences from the test images are indeed saved, but what I am missing is a model.pth to run the model on some ground truth images as I had hoped to do. Do you know if there is a way to extract this? similar to a model trained with the https://github.com/Project-MONAI/tutorials/blob/main/3d_segmentation/swin_unetr_btcv_segmentation_3d.ipynb routine. Once I have that model file, I shouldn't necessarily need to go through the rest of the auto3dseg pipeline.

All of this with the caveat that Auto3dseg doing this automatically without GPU issues would be great!

udiram avatar Sep 08 '23 15:09 udiram

Hi @udiram, I looked at the source code, and found that the model is saved under "bundle_root/models". https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1136 https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/configs/hyper_parameters.yaml#L2

Thanks!

KumoLiu avatar Sep 11 '23 04:09 KumoLiu

Thanks @KumoLiu! I'll give it a go!

udiram avatar Sep 12 '23 15:09 udiram

hi @KumoLiu is there anywhere for me to see which model performed best during training? so I can run inference using that model, I notice that every fold for every model has an associated .pt file but I'm not seeing a global best model/fold.

thanks

udiram avatar Sep 12 '23 23:09 udiram

Hi @udiram, I think "model.pt" is the best model for each fold. There is also a final model has been saved. You may need to ensemble to get the final result. https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1288-L1295 https://github.com/Project-MONAI/research-contributions/blob/0cd69f2a64b727ab8103d30512b39a7eb6a09ed3/auto3dseg/algorithm_templates/segresnet/scripts/segmenter.py#L1136-L1137

KumoLiu avatar Sep 13 '23 03:09 KumoLiu

Hi @KumoLiu , thanks for the info, so I guess I'm a bit stuck until this ensembling issue is figured out, is there anything else, debugging or log wise, that you or @dongyang0122 need in order to figure it out?

thanks!

udiram avatar Sep 13 '23 13:09 udiram

Hi @udiram, for how to ensemble, you can refer to: https://github.com/Project-MONAI/tutorials/blob/main/modules/cross_validation_models_ensemble.ipynb https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/ensemble_builder.py#L404

Thanks!

KumoLiu avatar Sep 18 '23 02:09 KumoLiu

Hi @KumoLiu

Thanks for the resources, does this integrate into the Auto3dseg pipeline in any way? Is there any ways to point the ensembler at the files generated by auto3dseg?

Thanks

udiram avatar Sep 18 '23 02:09 udiram

Hi @udiram, yes, it has already been integrated into the AutoRunner. https://github.com/Project-MONAI/MONAI/blob/281cb0119c01eaa8e6c841880b91f92f45e8d7f7/monai/apps/auto3dseg/auto_runner.py#L815

You can also override it by:

runner = AutoRunner(input=input)
runner.set_ensemble_method(ensemble_method_name="AlgoEnsembleBestByFold")

Just FYI: https://github.com/Project-MONAI/tutorials/tree/main/auto3dseg/notebooks

Thanks!

KumoLiu avatar Sep 18 '23 06:09 KumoLiu

sure, I'll give the over ride a try, do you have any ideas on how to run ensembling with less gpu usage, similar to the fix during validation for #1089 ?

thanks!

udiram avatar Sep 18 '23 15:09 udiram

Hi @KumoLiu, just following up on this issue!

udiram avatar Sep 27 '23 14:09 udiram