GraphMotion icon indicating copy to clipboard operation
GraphMotion copied to clipboard

Training on the provided config not converging

Open TL-QZ opened this issue 1 year ago • 1 comments

Dear Authors of GraphMotion,

Congrats on your great work and thank you for open-sourced the codebase. However, using the provided instruction for the training, I am not able to get a converge on the validation performance despite seeing the training loss decreasing. I have attached a screen shot of the tensorboard for all the validation and training curve, specifically when training the diffusion component using the provided checkpoint of the VAEs.

Could you please give some hint or is there any thing that need to be change in the config file?

The only difference is that I am currently using 2 GPU instead of 4, which will cause a total batch size of 256, but I assume this will not cause the model not converging at all?

Thanks in advance for you help!

Screenshot 2025-01-12 at 9 45 34 PM Screenshot 2025-01-12 at 9 46 05 PM Screenshot 2025-01-12 at 9 46 21 PM

TL-QZ avatar Jan 13 '25 02:01 TL-QZ

I found that the batchsize has a significant impact on the model’s performance, so I do not recommend changing it. The reason is simple: our model has three stages of the diffusion process, and we train all three stages simultaneously. To achieve better performance, the convergence rates of these three stages need to be similar.

If your computer doesn’t have enough GPU memory to train all three stages at once, you can try tuning one stage at a time. I will release an updated version of the code in a week that supports training one stage at a time. In my experiments, training one stage at a time leads to better performance, as it helps solve the issue of inconsistent convergence rates between the stages.

jpthu17 avatar Jan 14 '25 12:01 jpthu17