Theoritically, what limits the number of frames the model can inferrence?

Open HWT-WalterHu opened this issue 1 year ago • 2 comments

The parameter matrix of DiT can compute with any shape of latents, isn't it? If we increase the dim of frame like from 13 to 27, only the shape of attention map get bigger, but it still can do matrix compute with pre-trained parameters and generate a longer video?

So what limits the number of frames?

Jan 13 '25 07:01 HWT-WalterHu

The max training length

Jan 14 '25 05:01 yzy-thu

The max training length

Thanks for your reply. So the answer is that model can compute on bigger dimension input tensor, but the generated video quality will drop a lot, because the model hasn't been trained on that length, right?

Jan 14 '25 07:01 HWT-WalterHu