pengyige123 comments

Results 5 comments of


                                            pengyige123

[Bug] The low-noise model weights trained in WAN2.2 are then used for further low-noise VSA training, and the loss becomes Nan.

> Could you share a bit more about what you are trying to train? Are you trying to train VSA + Wan2.2 moe model? Yes, I used WAN2.2 + VSA...

[Bug] The low-noise model weights trained in WAN2.2 are then used for further low-noise VSA training, and the loss becomes Nan.

> Hopper GPU ok, I'll try out the Thunderkitten at H800 first，However, there's a strange phenomenon here: the high-noise model trains very well. The high-noise and low-noise models are the...

[Bug] The low-noise model weights trained in WAN2.2 are then used for further low-noise VSA training, and the loss becomes Nan.

> do you have a branch with your scripts? I think there may be a bug in our VSA triton kernel bwd, but I'm not sure if this is the...

[Bug] The low-noise model weights trained in WAN2.2 are then used for further low-noise VSA training, and the loss becomes Nan.

> Just for reference, the bug mentioned by [@SolitaryThinker](https://github.com/SolitaryThinker) was just fixed in the PR ([#879](https://github.com/hao-ai-lab/FastVideo/pull/879)). The new version of VSA Triton kernel might also be worth trying. I'll try...

[Bug] Why is VSA slower than Flash_attention when I run the training script examples/training/finetune/Wan2.1-VSA/Wan-Syn-Data/T2V-14B-VSA.slurm?

> If you’re using sparsity decay, then at the beginning of training sparsity is zero, so the model computes full attention, which is typically slower than a FlashAttention implementation. Thank...