AnimateDiff I use image_finetune.yaml to fine-tune the model, and I found that the first step of each epoch would get stuck for 5 minutes

I changed the train_step parameter inside image_finetune.yaml to 2000 steps, which will train multiple epochs, but the machine gets stuck for five minutes at the start of each epoch.

This seems to be related to the num workers of dataloader, after I change the num workers to 0, the program stuck phenomenon disappeared, but each step will be longer because of the data loading time, I don't know if you have encountered this problem? I used 8 A800s to fine-tune the model.

Apr 18 '24 12:04 HelloWorldBeginner

Hello, the image finetune stage in the paper should be to train a domain adapter, but the setting in the config is to train a complete unet. Have you found this problem? 1713497542736

Apr 19 '24 03:04 J-Wu97

@HelloWorldBeginner Hi, did you figure out why the program is stucked? In my experiment, the program is always stuck. I tried to modify the num_of_worker, but it doesn't work.

May 07 '24 08:05 sunnyHelen

Hi @HelloWorldBeginner ,

I have tried setting num_workers=0, but it seems that I have obtained the opposite conclusion that training with a higher value of num_workers (previously 32) increases the speed of data loading. I am not sure where the problem lies and how to solve this. Do you have any update on this issue?

Best regards,

May 30 '24 08:05 AlonzoLeeeooo

Hi @HelloWorldBeginner ,

I have tried setting num_workers=0, but it seems that I have obtained the opposite conclusion that training with a higher value of num_workers (previously 32) increases the speed of data loading. I am not sure where the problem lies and how to solve this. Do you have any update on this issue?

Best regards,

The issue only occurs during multi-GPU training You need to modify the training shell script to support multiple GPUs

May 30 '24 09:05 HelloWorldBeginner

Hi @HelloWorldBeginner , I have tried setting num_workers=0, but it seems that I have obtained the opposite conclusion that training with a higher value of num_workers (previously 32) increases the speed of data loading. I am not sure where the problem lies and how to solve this. Do you have any update on this issue? Best regards,

The issue only occurs during multi-GPU training You need to modify the training shell script to support multiple GPUs

Hi @HelloWorldBeginner ,

I have tried multi-GPU training, but the code still gets stuck at 0%. Here is my command line: torchrun --nodes 1 --nproc_per_node=2 train.py --config configs/training.yaml.

Besides, I found that the code gets stuck at the video_reader() function, where video_reader() is VideoReader from decord. I wonder whether this is because using decord causes slow data loading speed. Do you have any idea about this?

Best

May 30 '24 14:05 AlonzoLeeeooo