GLM
GLM copied to clipboard
Why does the model occupy less GPU memory after quantization, but the inference speed is slower?
Using the vector-wise symmetric quantization method.
I don't know what quantization method you are using. Maybe only the weights are quantized and the computation is still conducted with FP16. Another possible reason is that the hardware you are using doesn't support INT8 acceleration.