DidiD1 comments

Results 11 comments of


                                            DidiD1

Asking about "import Comparative_models.CE as CE"

I guess the author use this package to compare with the performance of GAN-based method in the code, so if you just want to run AnoDDPM, you can delete all...

zero output

the code i use: ```bash python evaluate.py \ --videos_path ./VBench/11_29_3s \ --dimension "motion_smoothness" \ --mode "custom_input" ``` and the output be like: { "subject_consistency": [ 0.0, [ { "video_path": xxx,...

zero output

> same problem here #87 I track the bug found a strange thing is that, the image_features of the clip is all 0, which leads to the result of compute_background_consistency...

The density_for_timestep_sampling and loss_weighting for SD3 Training！！！

This phenomenon was mentioned in the SD3 paper，maybe why they proposed 'mode sampling with heavy-tails' time-sampling method. However it's strange that in their experiment results 'log-norm' is much better the...

The density_for_timestep_sampling and loss_weighting for SD3 Training！！！

Thanks a lot. And for my question3: "when we use logit_normal, it based on the RF-setting. So the weight of the loss should be t/(1-t), but the code doesn't compute...

The density_for_timestep_sampling and loss_weighting for SD3 Training！！！

> > currently we're using sigmoid sampling for timesteps which seems fine but no one has really ablated whether it leaves fine details out > > Actually, sigmoid and lognorm...

Question on "fake_context_parallel_forward" in diffusers implementation

> What's this cache aiming for? Does it mean I can call the encode multiple times (split on the n_frame dimension) to lower maximum GPU memory requirements while getting the...

推理只能49帧吗

> 如题，inference的时候，num_frames只能设置为49吗，能否选择更短的帧数呢？我尝试的生成结果会有些问题好像t2v可以换帧数，但是i2v就不行，我猜可能跟i2v的learnable_pos_embed有关系？

human_action question

i also find this question, it shows that the encode embedding of the video is 0, so strange

想请教一下论文中的3D full attention的实现具体在哪里呢？

3D full attention和2D+1D可以理解为patch化的方式不一样吧，3D full就是直接3个维度全都patch了，一个patch就是2*2*1（h*w*t），分开attention是分别保留了时间和空间维度信息的。