aimet How to quantize a pruned model

Hi,

So far I've been testing AIMET model compression features (mainly using channel compression only). After performing channel pruning (auto mode), I then perform post-training quantization (CLE + BC) followed by QAT. What I've observed so far is that the compressed model, after post-training quantization loses most of it accuracy, and doesn't improve after fine-tuning with QAT.

I'm wondering what is the best strategy to quantize a pruned model? Or perhaps what is the correct procedure to perform model compression and quantization that will create the model with the best accuracy (is it possible to prune a quantized model using AIMET API?)

Note: the reason why I didn't perform SVD compression follow by channel pruning was because of inference time on hardware. It seem that SVD reduced overall MACs but doesn't help with inference time, while channel pruning reduce both MACs and inference time hence I opt to use only channel pruning.

Nov 04 '21 14:11 LLNLanLeN

Hi @LLNLanLeN Thank you for sharing your observations and the query. Had a couple of follow-up questions 1) Was fine-tuning performed after Channel pruning? What was the FP32 accuracy achieved after channel pruning? Was it satisfactory? 2) It is possible that CLE may not be applicable to this model - Could you check if it has Relu6? And, what is the FP32 accuracy after replacing Relu6 with Relu (API example) ? That can tell whether CLE can be used in this case. Also, you could instead try applying AdaRound and then performing QAT. Could you please share your observations. Thanks

Nov 08 '21 22:11 quic-ssiddego

You should finetune the model first after the pruning, what happens to the model after the pruning removing the filters, in which the features maps are removed as well then the models are broken, you need to fine-tune them for 10-15 epochs, which can update the stats you need for quantization. With tensor decomposition "svd here" you increase the depth of the model, that is why your inference time increase, way to solve this looks deeper to the hardware you are going to deploy your model for, and simply you can improve the inference time by understanding how the IR that generated from the computational graph, it is complier and it is easy to solve.

Dec 15 '21 18:12 Silk760

@Silk760 @quic-ssiddego I forgot to reply back to this thread. All of these recommendations are valid point and by the time that I posted this issues, I have already try all of these methods (Adaround, fine-tuning after pruning then quantized, etc....). What I found is that it's not really about how the overall model was compressed that lead to bad result, it's about how certain layers are compressed.

After a lot of experiments with finally some success, what I found is that, when layers are too compressed (for example, from 1xxx channels compressed to 1xx channels), etc... then none of the recommendations would help recover the accuracy. Hence it's important for me to pick the compress ratio so that the new compressed layer isn't too small, and that would likely help recover the accuracy after pruning + quantization.

Jan 24 '22 16:01 LLNLanLeN

@Silk760 @quic-ssiddego I forgot to reply back to this thread. All of these recommendations are valid point and by the time that I posted this issues, I have already try all of these methods (Adaround, fine-tuning after pruning then quantized, etc....). What I found is that it's not really about how the overall model was compressed that lead to bad result, it's about how certain layers are compressed.

After a lot of experiments with finally some success, what I found is that, when layers are too compressed (for example, from 1xxx channels compressed to 1xx channels), etc... then none of the recommendations would help recover the accuracy. Hence it's important for me to pick the compress ratio so that the new compressed layer isn't too small, and that would likely help recover the accuracy after pruning + quantization.

This is I have a paper that explains how to do that, called ultimate compression the idea to study the sensitivity of the layer for pruning or quantization, even I did for very quantized model 1 bit but can work for any other method such as quantization. The idea is to study how the layers are sensitive to the method you want to apply and do it layer by layer.

Jan 24 '22 17:01 Silk760

@Silk760 If you can share with me that paper, it would be great. I'm very new to the model compression idea and any resource will be helpful. Thanks 😃

Jan 24 '22 17:01 LLNLanLeN

@Silk760 If you can share with me that paper, it would be great. I'm very new to the model compression idea and any resource will be helpful. Thanks 😃

Under review, I will share with you when it is published but if you want any help put the code, and I can help you with it. the idea is to just try layer, by layer and study how this model affected, actually, there is no general way of doing pruning. Pruning with L1 for example might work well for classification models, but will not be efficient for object detection models or other applications. Actually, this field is full of a lot of ideas and connected with robustness, interpreting, and theory of deep neural networks. An idea to make the work easy for you is to use optimization algorithms it can be multi-objective, or one objective to construct models that fit with your goals. Keep sharing, and I will be able to help you yet.

Jan 24 '22 17:01 Silk760

https://arxiv.org/pdf/1803.03635.pdf Look at this

Jan 24 '22 17:01 Silk760

@Silk760 Good luck with your under-review-paper. And I appreciate the feedback so far.

Jan 24 '22 17:01 LLNLanLeN

Hi @Silk760 Thank you sharing the details from your study. With regard to "Hence it's important for me to pick the compress ratio so that the new compressed layer isn't too small, and that would likely help recover the accuracy after pruning + quantization. This is I have a paper that explains how to do that, called ultimate compression the idea to study the sensitivity of the layer for pruning or quantization, even I did for very quantized model 1 bit but can work for any other method such as quantization. The idea is to study how the layers are sensitive to the method you want to apply and do it layer by layer."

have some questions :

When would this paper be available to look at? =) 2) Would it be possible for you to contribute this to AIMET? This could be an alternate algorithm (ex: greedy) to select layers -- that we could plug-in to existing pruning support on AIMET. Please share your thoughts! Thank you.

Jan 25 '22 03:01 quic-ssiddego

Closing this issue due to inactivity. Please re-open it/ create a new issue if you need further help.

Apr 04 '23 16:04 quic-mangal