Jayce0625
Jayce0625
> 你好, > > 1. 更新alpha时变动的元素是alpha的grad值大还是值小的元素 > 论文算法1的10-12行有具体流程,第10行的`TopKMask`代表取累计梯度最大的K个位置生成 0-1 Mask Mt 。 Mt 为0代表需要保留的位置, Mt为1代表需要剪枝的位置,也即累计梯度大的位置需要被剪枝。 > 2. 为何grad比threshold大的部分参数重要性就小以至于能趋于0呢 > 首先需要区分在Progressive Pruning中会涉及到的两种不同mask。我们不想让需要被剪枝的值从原值直接变为0(这对应我们直接将二值化的 Mt 乘到模型里),而是想让其逐渐变为0,因此这里我们需要把二值化的 Mt 平滑成一个连续实值的mask ζt,也即让 ζt 作为真正乘到模型里的mask。 > 所有的mask...
You can try to install it by `pip install flash-attn --no-build-isolation` or install by compiling the source code `python setup.py install` also you can directly download the whl package compiled...
> I am currently facing the same issue. I am using NVIDIA's PyTorch docker: nvcr.io/nvidia/pytorch:23.08-py3. Inside it, flash-atten version 2.0.4 is already installed. Both nvcc and torch are based on...
> I am currently facing the same issue. I am using NVIDIA's PyTorch docker: nvcr.io/nvidia/pytorch:23.08-py3. Inside it, flash-atten version 2.0.4 is already installed. Both nvcc and torch are based on...