Moore-AnimateAnyone required minimum specs to start training ?

Hi @lixunsong, I am using a single GPU with 16GB VRAM for training, and I am encountering a CUDA out-of-memory error. Can you tell me the minimum specifications required to start training? I reduce the batch_size and video's dimension as well

Jan 19 '24 12:01 AMujtaba57

I met the same question, I am using the 4 GPU with 24GB VRAM for training, It also reports errors.

Jan 22 '24 01:01 zhang-yige

我用的8*24 GB也是一样的情况不知道哪里能调整减少内存占用

Jan 22 '24 12:01 seijiang

We currently cannot provide a specific number for the minimum amount of VRAM needed to train a well expected model. If you want to reduce VRAM usage, we recommend starting by decreasing the batch size, and then by lowering the resolution. For stage-2, you can also reduce the video length (e.g. from 24 to 16 frames.)

Jan 22 '24 14:01 lixunsong

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Jan 24 '24 08:01 guanghui0607

@guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work.

Jan 24 '24 08:01 AMujtaba57

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

Jan 24 '24 08:01 lixunsong

After doing the following changes, it worked on the single GPU.

add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line
Set use_8bit_adam and gradient_checkpointing as True
pip install bitsandbytes

Jan 26 '24 06:01 guanghui0607

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked

Jan 26 '24 06:01 guanghui0607

@guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work.

thanks! I haven't got a100 yet, will try once i get it

Jan 26 '24 07:01 guanghui0607

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked Can you train in 8 * V100 now?

Jan 30 '24 12:01 renrenzsbbb

I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.

Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.

I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked Can you train in 8 * V100 now?

No, i can only train it on a single v100, but it stopped without any error after training 20% of train steps

Jan 31 '24 03:01 guanghui0607

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file

Jan 31 '24 06:01 guanghui0607

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file

Thanks for your advice. Can you get good result in the setting for the first stage?

Jan 31 '24 12:01 renrenzsbbb

managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file

I find that changing the value of use_8bit_adam to true can greatly reduce the memory, and I can train it with 768 * 768 in batchsize = 1 or 512 * 512 in batch size =2

Jan 31 '24 12:01 renrenzsbbb

After doing the following changes, it worked on the single GPU.

add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line

Set use_8bit_adam and gradient_checkpointing as True

pip install bitsandbytes

yesh, it works, using V100(32G)

May 08 '24 14:05 duanjiding