GLM icon indicating copy to clipboard operation
GLM copied to clipboard

Why does the model occupy less GPU memory after quantization, but the inference speed is slower?

Open Ant0082 opened this issue 2 years ago • 1 comments

Using the vector-wise symmetric quantization method.

Ant0082 avatar Feb 21 '23 08:02 Ant0082

I don't know what quantization method you are using. Maybe only the weights are quantized and the computation is still conducted with FP16. Another possible reason is that the hardware you are using doesn't support INT8 acceleration.

duzx16 avatar Feb 28 '23 03:02 duzx16