Resume with torchrun: The CPU memory consumption keeps increasing when using the train code of image classification with "--resume" in references.

Open JeremyCJM opened this issue 3 years ago • 0 comments

🐛 Describe the bug

I am training the ViT-B16 using the reference code provided by PyTorch: https://github.com/pytorch/vision/blob/a61e6ef6ff5af041661ecc70b1a7e3dacb2240b6/references/classification/train.py.

However, when I resume the training using this code with the distributed mode (using torchrun), I observed increasing CPU memory consumption epoch by epoch, which leads to program termination after resuming and running several epochs. The train script is: torchrun --nproc_per_node=8 train.py\ --model vit_b_16 --epochs 300 --batch-size 128--opt adamw --lr 0.003 --wd 0.3\ --lr-scheduler cosineannealinglr --lr-warmup-method linear --lr-warmup-epochs 30\ --lr-warmup-decay 0.033 --amp --label-smoothing 0.11 --mixup-alpha 0.2 --auto-augment ra\ --clip-grad-norm 1 --ra-sampler --cutmix-alpha 1.0 --model-ema\ --resume path_to_my_checkpoint

Could you help to solve the bug?

Versions

PyTorch version: 1.12.1+cu113 Is debug build: False CUDA used to build PyTorch: 11.3 ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.1 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0 Clang version: Could not collect CMake version: version 3.24.0-rc2 Libc version: glibc-2.31

Python version: 3.9.7 (default, Sep 16 2021, 13:09:58) [GCC 7.5.0] (64-bit runtime) Python platform: Linux-5.4.0-121-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: Could not collect GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3090 GPU 1: NVIDIA GeForce RTX 3090 GPU 2: NVIDIA GeForce RTX 3090 GPU 3: NVIDIA GeForce RTX 3090 GPU 4: NVIDIA GeForce RTX 3090 GPU 5: NVIDIA GeForce RTX 3090 GPU 6: NVIDIA GeForce RTX 3090 GPU 7: NVIDIA GeForce RTX 3090

Nvidia driver version: 515.48.07 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Versions of relevant libraries: [pip3] mypy-extensions==0.4.3 [pip3] numpy==1.19.5 [pip3] numpydoc==1.1.0 [pip3] pytorch3d==0.6.1 [pip3] torch==1.12.1+cu113 [pip3] torchaudio==0.12.1+cu113 [pip3] torchvision==0.13.1+cu113 [conda] blas 1.0 mkl [conda] cudatoolkit 11.3.1 h2bc3f7f_2 [conda] mkl 2021.4.0 h06a4308_640 [conda] mkl-service 2.4.0 py39h7f8727e_0 [conda] mkl_fft 1.3.1 py39hd3c417c_0 [conda] mkl_random 1.2.2 py39h51133e4_0 [conda] numpy 1.19.5 pypi_0 pypi [conda] numpydoc 1.1.0 pyhd3eb1b0_1 [conda] pytorch-mutex 1.0 cuda pytorch [conda] pytorch3d 0.6.1 pypi_0 pypi [conda] torch 1.12.1+cu113 pypi_0 pypi [conda] torchaudio 0.12.1+cu113 pypi_0 pypi [conda] torchvision 0.13.1+cu113 pypi_0 pypi

Not that for torchvision, I directly replaced with the version on github on August 10.

Aug 15 '22 03:08 JeremyCJM