Peng Xiao comments

Repositories
Issues
Comments

Results 3 comments of


                                            Peng Xiao

[Bug] Why is VSA slower than Flash_attention when I run the training script examples/training/finetune/Wan2.1-VSA/Wan-Syn-Data/T2V-14B-VSA.slurm?

If you’re using sparsity decay, then at the beginning of training sparsity is zero, so the model computes full attention, which is typically slower than a FlashAttention implementation.

[Bug] param.grad is None or gradient is zero when reproducing FastWan

> For the 1.3B model, we use 61 frames for training. There appears to be a potential gradient bug in the VSA cuda kernel with the 1.3B + 77×448×832 configuration....

[Bug] param.grad is None or gradient is zero when reproducing FastWan

> > > For the 1.3B model, we use 61 frames for training. There appears to be a potential gradient bug in the VSA cuda kernel with the 1.3B +...