hiyijian
hiyijian
I had the same problem. We need a way to exclude SW layer from O2 just like BN. But I have not found a proper way
Thanks. Do you think the sparsity will be effected if BN layers on main branch are not penalty by L1 norm. If yes, how? Thanks
Thanks
how about the finnal ROC performance on FDDB please? Is it also the same as original one ?
Yes. The Plugin only support for RoCE now?
@paravmellanox is there any update now? Thanks
@addcloud I am not an expert at network stuff at all. I used to stuck in enabling SRIOV for a quite long time. The reason for failing to enable it...
These is no network initialization in this repo. Probably, this is the reason why we get totally diffrient results by using CUDA10.2 and CUDA 9.2
@danthe3rd I also need alibi support. for now, I pass ```bias = LowerTriangularMaskWithTensorBias(alibi_bias)``` to ```xops.memory_efficient_attention(..., attn_bias=bias )```. The forward only is ok, but failed at backward in training mode. Is...
@borisfom Maybe another mismatch: wgrad_norm in your code is computed from "g + beta* w"(it is computed after regularization), not exactly the same as paper's "g".