need help about both model weight and activation quantization with only a float32 mlmodel

Open AndreaChiChengdu opened this issue 1 year ago • 1 comments

from the issue "https://developer.apple.com/forums/thread/740518 how do we use the computational power of A17 Pro Neural Engine?"

I learn that if i want to inference my mlmodel on my ipad pro with m4 soc int8 38T ane high performance, i have to use the coreml torch api to quantize both weight and activation during training time quantization with int8 datatype.

my question is: I only have a fp32 mlmodel without torch code or model, what can i do? by the way, if just only weight int8 quantization, M4 ane will use fp16 to compute or int8? thanks for your help～

May 27 '24 08:05 AndreaChiChengdu

If you don't have the torch model, you will not be able to do training-aware quantization, instead, you will be only able to run the post-training-quantization through the ct.optimize.coreml API.
The weight only quantization will result in a model in size of int8, but at the runtime, the compute precision is still fp16.

May 30 '24 20:05 jakesabathia2