Question: Error when substituting the quantized matrix multiplication operator.
In AWQ inference, the quantized weight matrix is dequantized to fp16 and then multiplied by the input matrix x in the linear layer.
But I try to directly replace the fp16 matrix after the dequantization with the original weight matrix(llama2-7b-hf), the inference error will be particularly large. (According to the calculation formula, WX = (DQ(Q(Ws)s-1) X, [where DQ fuse scale s^-1]. In this case, the inverse quantized matrix should be equivalent to the original matrix W.)
From:
out = dequantize_gemm(qweight, qzeros, scales, w_bit, group_size)
out = torch.matmul(x, out)
to:
out = weight.T
out = torch.matmul(x, out)
where weight.T is the original weight matrix in fp16(llama2-7b-hf).
The Perplexity of wikitext2 is from 5.619 to 1324.6.
Hi @grysgreat, this seems to be expected. You cannot recover the original fp16 with a transpose since you have lost a bunch of information when you quantize -> dequantize -> transpose.
Thanks for your answer, but I'm still curious as to which part of the AutoAWQ algorithm is responsible for not replacing the weight in the linear layer directly with the original weight(downloaded from hf) (in AutoGPTQ, such a replacement operation can get the correct result - the same accuracy as Fp16).