Why does the model occupy less GPU memory after quantization, but the inference speed is slower?

Open Ant0082 opened this issue 2 years ago • 1 comments

Using the vector-wise symmetric quantization method.

Feb 21 '23 08:02 Ant0082

I don't know what quantization method you are using. Maybe only the weights are quantized and the computation is still conducted with FP16. Another possible reason is that the hardware you are using doesn't support INT8 acceleration.

Feb 28 '23 03:02 duzx16