CogVideo
CogVideo copied to clipboard
Theoritically, what limits the number of frames the model can inferrence?
The parameter matrix of DiT can compute with any shape of latents, isn't it? If we increase the dim of frame like from 13 to 27, only the shape of attention map get bigger, but it still can do matrix compute with pre-trained parameters and generate a longer video?
So what limits the number of frames?
The max training length
The max training length
Thanks for your reply. So the answer is that model can compute on bigger dimension input tensor, but the generated video quality will drop a lot, because the model hasn't been trained on that length, right?