Samit
Samit
"we consider the first frame of a video to be an image..." I see, the first frame is always encoded from the repeated k-1 1st frames. But for upsampling, the...
> Sorry for that. We merge that to fix this bug. thanks. btw, since the computation logic is changed, the model may require re-training.
+1 Looking forward to the open-source of text2video model
I see. So attention map complexity will be (H*W*T)^2. Is it feasible for long video training? Are there any generation results using the train code? (Loss curve in diffusion model...
Please supplement README on accuracy and performance compared to ViT
Please report the results for crnn server version and upload the checkpoint and mindir.
Thanks. checkpoint保存:每个epoch结束保存ckpt。 这个可选last_k 或者top_k保存策略。
> Thanks for this contribution! As we discussed offline, we'll be carefully reviewing this PR/design and think about how to enable end-to-end support for models like this with vLLM! looking...
Should we consider supporting E/P/D disaggregation for large-scale multimodal model serving? It's a beneficial feature for large-batch or encode-compute-heavy MLLM deployment scenarios. https://github.com/vllm-project/vllm/pull/25233
I think we can support TP and CP for diffusion models at first by re-using the parallelism interfaces in vllm. Then we can verify whether the CP interfaces like `sequence_parallel_chunk`...