Moore-AnimateAnyone icon indicating copy to clipboard operation
Moore-AnimateAnyone copied to clipboard

required minimum specs to start training ?

Open AMujtaba57 opened this issue 2 years ago • 15 comments

Hi @lixunsong, I am using a single GPU with 16GB VRAM for training, and I am encountering a CUDA out-of-memory error. Can you tell me the minimum specifications required to start training? I reduce the batch_size and video's dimension as well

image

AMujtaba57 avatar Jan 19 '24 12:01 AMujtaba57

I met the same question, I am using the 4 GPU with 24GB VRAM for training, It also reports errors.

zhang-yige avatar Jan 22 '24 01:01 zhang-yige

我用的8*24 GB也是一样的情况 不知道哪里能调整减少内存占用

seijiang avatar Jan 22 '24 12:01 seijiang

We currently cannot provide a specific number for the minimum amount of VRAM needed to train a well expected model. If you want to reduce VRAM usage, we recommend starting by decreasing the batch size, and then by lowering the resolution. For stage-2, you can also reduce the video length (e.g. from 24 to 16 frames.)

lixunsong avatar Jan 22 '24 14:01 lixunsong

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

guanghui0607 avatar Jan 24 '24 08:01 guanghui0607

@guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work.

AMujtaba57 avatar Jan 24 '24 08:01 AMujtaba57

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

lixunsong avatar Jan 24 '24 08:01 lixunsong

After doing the following changes, it worked on the single GPU.

  1. add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line
  2. Set use_8bit_adam and gradient_checkpointing as True
  3. pip install bitsandbytes

guanghui0607 avatar Jan 26 '24 06:01 guanghui0607

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked

guanghui0607 avatar Jan 26 '24 06:01 guanghui0607

@guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work.

thanks! I haven't got a100 yet, will try once i get it

guanghui0607 avatar Jan 26 '24 07:01 guanghui0607

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked Can you train in 8 * V100 now?

renrenzsbbb avatar Jan 30 '24 12:01 renrenzsbbb

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked Can you train in 8 * V100 now?

No, i can only train it on a single v100, but it stopped without any error after training 20% of train steps

guanghui0607 avatar Jan 31 '24 03:01 guanghui0607

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file

guanghui0607 avatar Jan 31 '24 06:01 guanghui0607

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file

Thanks for your advice. Can you get good result in the setting for the first stage?

renrenzsbbb avatar Jan 31 '24 12:01 renrenzsbbb

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file

I find that changing the value of use_8bit_adam to true can greatly reduce the memory, and I can train it with 768 * 768 in batchsize = 1 or 512 * 512 in batch size =2

renrenzsbbb avatar Jan 31 '24 12:01 renrenzsbbb

After doing the following changes, it worked on the single GPU.

  1. add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line
  2. Set use_8bit_adam and gradient_checkpointing as True
  3. pip install bitsandbytes

yesh, it works, using V100(32G)

duanjiding avatar May 08 '24 14:05 duanjiding