Peng Xiao

Results 3 comments of Peng Xiao

If you’re using sparsity decay, then at the beginning of training sparsity is zero, so the model computes full attention, which is typically slower than a FlashAttention implementation.

> For the 1.3B model, we use 61 frames for training. There appears to be a potential gradient bug in the VSA cuda kernel with the 1.3B + 77×448×832 configuration....

> > > For the 1.3B model, we use 61 frames for training. There appears to be a potential gradient bug in the VSA cuda kernel with the 1.3B +...