OUT OF MEMORY

Open BlackPuuuudding opened this issue 1 year ago • 1 comments

Why does my training run out of memory when the batch size is set to 4, and out of memory when the batch size is set to 2 during multi-GPU training, yet the paper is able to set it to 8? I'm using the same device as the one mentioned in the paper, which is a 4090, and the cpkt is SD 1.4 and interact-diffusion-v1-1.pth. Thank you!!

Apr 26 '24 07:04 BlackPuuuudding

We use this command for training:

CUDA_VISIBLE_DEVICES=0,1 torchrun --nproc_per_node=2 main.py --yaml_file configs/hoi_hico_text.yaml --ckpt <existing_gligen_checkpoint> --name test --batch_size=4 --gradient_accumulation_step 2 --total_iters 500000 --amp true --disable_inference_in_training true --official_ckpt_name <existing SD v1.4/v1.5 checkpoint>

We use AMP, batch size is set to 4 for each GPU.

Training detail at readme

Apr 26 '24 07:04 jiuntian