Dequantizing int8 models to fp16
I have loaded an LLM in huggingface with load_in_8bit=True.
I noticed the objects in the state_dict are structured something like
-
model.layers.18.self_attn.k_proj.weight -
model.layers.18.self_attn.k_proj.SCB -
model.layers.18.self_attn.k_proj.weight_format
The SCB and weight_format are present only in the quantized model. I think SCB refers to scale and bias that can help us in recreating the original tensor? weight_format is a string that says “row”. The huggingface integration guide mentions a .CB field in addition to the .SCB field, but I could not find it in the state_dict. Not sure if the codebase has changed since that was written?
Anyway, I am not sure about the exact method to dequantize the tensor to get back the original, but I tried the following:
(weight_SCB.unsqueeze(1) * weight)/127
This is giving a tensor that is close to the original model (what I get without adding the parameter load_in_8bit=True), but not the same.
I am not sure whether I am following the correct approach for dequantization. Would be great if someone could point me to some code or documentation on how I can recreate the exact original tensor from the weights.
As a follow up question, I know that for some models there are outlier values that are not quantized even though other values in the tensor are quantized. However I could not find this information in the state_dict. How can we find and handle these values during the dequantization process?