Gordon Wang (Guanghui Wang)

Results 6 comments of Gordon Wang (Guanghui Wang)

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

After doing the following changes, it worked on the single GPU. 1. add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line 2. Set use_8bit_adam and gradient_checkpointing as True...

> > I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training. > > Can you...

> @guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work. thanks! I haven't got a100 yet, will try once i get...

> > > > I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training. > >...

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file