required minimum specs to start training ?
Hi @lixunsong, I am using a single GPU with 16GB VRAM for training, and I am encountering a CUDA out-of-memory error. Can you tell me the minimum specifications required to start training? I reduce the batch_size and video's dimension as well
I met the same question, I am using the 4 GPU with 24GB VRAM for training, It also reports errors.
我用的8*24 GB也是一样的情况 不知道哪里能调整减少内存占用
We currently cannot provide a specific number for the minimum amount of VRAM needed to train a well expected model. If you want to reduce VRAM usage, we recommend starting by decreasing the batch size, and then by lowering the resolution. For stage-2, you can also reduce the video length (e.g. from 24 to 16 frames.)
I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.
@guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work.
I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.
Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.
After doing the following changes, it worked on the single GPU.
- add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line
- Set use_8bit_adam and gradient_checkpointing as True
- pip install bitsandbytes
I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.
Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.
I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked
@guanghui0607 , Can you try testing it with A100x8, each having 40GB? I think this gonna be work.
thanks! I haven't got a100 yet, will try once i get it
I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.
Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.
I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked Can you train in 8 * V100 now?
I'm using 8xV100(32GB), even I set 1 to batch size and 8 to height/width, 1 to sample_margin, still got the OOM error in training.
Can you run inference normally? It's abnormal to encounter an OOMerror when reducing the height and width to 8.
I couldn't run inference normally, but when doing above changes, even I set height and width to 512, it also worked Can you train in 8 * V100 now?
No, i can only train it on a single v100, but it stopped without any error after training 20% of train steps
managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file
managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file
Thanks for your advice. Can you get good result in the setting for the first stage?
managed to train model on 8*v100 GPUs by just changing the value of use_8bit_adam to true, and train_width/train_height to 256 in the default config file
I find that changing the value of use_8bit_adam to true can greatly reduce the memory, and I can train it with 768 * 768 in batchsize = 1 or 512 * 512 in batch size =2
After doing the following changes, it worked on the single GPU.
- add retain_graph=True in the line torch.autograd.backward(outputs_with_grad, args_with_grad) in /opt/conda/envs/animate/lib/python3.10/site-packages/torch/utils/checkpoint.py: 157 line
- Set use_8bit_adam and gradient_checkpointing as True
- pip install bitsandbytes
yesh, it works, using V100(32G)