QPyTorch
QPyTorch copied to clipboard
Why the gradient scaling factor is multiplied before quantization?
https://github.com/Tiiiger/QPyTorch/blob/ed0d8b17680254799f2f3960e9e7f848b8bb9db4/qtorch/optim/optim_low.py#L81
In OptimLP, the gradient scaling factor is multiplied before quantization. However, grad scaling is meant to prevent possible underflow of low precision quantized gradient values. I think the current implementation cannot prevent underflow.
Maybe the correct implementation is to multiply the scaling factor after quantization.
p.grad.data = self.grad_quant(p.grad.data) * self.grad_scaling