CogVideo Training details on image to video.

System Info / 系統信息

First of all, thank you for your excellent work!

Information / 问题信息

[X] The official example scripts / 官方的示例脚本
[ ] My own modified scripts / 我自己修改的脚本和任务

Reproduction / 复现过程

In the paper, the description of image to video does not seem to be very detailed. I have some questions that I hope to get answers from you:

Based on the text to video model, how much training data is needed to convert it into a text and image to video model?
Description of the paper "To enhance the model's robustness, we add large noise to the image condition during training.", How much extra noise should be added to the image condition?

Expected behavior / 期待表现

Related issues: https://github.com/THUDM/CogVideo/issues/88

Sep 02 '24 09:09 wangqiang9

It was previously mentioned that training cogVideoX-5b took approximately 100k H100 GPU hours. Given this, how difficult would it be to train i2v model based on the t2v model? Could it possibly take only 1/100 of the time or less?

Sep 03 '24 06:09 lishiyuwhu

We used approximately 1/10 of the gpu hours for i2v fine-tuning, but a similar performance can be achieved with less GPU hours

Sep 10 '24 07:09 yzy-thu