Gordon Wang (Guanghui Wang)
Gordon Wang (Guanghui Wang)
I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.
After doing the following changes, it worked on the single GPU. 1. add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line 2. Set use_8bit_adam and gradient_checkpointing as True...
> > I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training. > > Can you...
> @guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work. thanks! I haven't got a100 yet, will try once i get...
> > > > I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training. > >...
managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file