Why the gradient scaling factor is multiplied before quantization?

Open Guangxuan-Xiao opened this issue 3 years ago • 0 comments

https://github.com/Tiiiger/QPyTorch/blob/ed0d8b17680254799f2f3960e9e7f848b8bb9db4/qtorch/optim/optim_low.py#L81

In OptimLP, the gradient scaling factor is multiplied before quantization. However, grad scaling is meant to prevent possible underflow of low precision quantized gradient values. I think the current implementation cannot prevent underflow.

Maybe the correct implementation is to multiply the scaling factor after quantization.


p.grad.data = self.grad_quant(p.grad.data) * self.grad_scaling

Aug 17 '22 07:08 Guangxuan-Xiao