ZeroQuant not compressing and making BERT slower
Describe the bug I was expecting a compressed & faster BERT model after running the BERT ZeroQuant example in DeepSpeedExamples. However, the clean model isn't any smaller (still 417.7 MB) or faster (in fact, it's slower) than the original.
To Reproduce
- Go to Google Colab and change to GPU runtime
- Run the following: pip install deepspeed==0.7.0 git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples/model_compression/bert (In the zero_quant.sh file, change master_port (e.g. to 9995) and task to sst2 & eval_batch_size to 32 (otherwise you'll get CUDA out of memory)) bash bash_script/ZeroQuant/zero_quant.sh
Expected behavior I expected the final clean model to be a compressed version of the original one, thus being smaller & faster but it isn't.
ds_report output

System info (please complete the following information):
- OS: Ubuntu 18.04.6 LTS
- 1 Tesla T4 GPU
- Tried with both 3.7.13 and 3.9
Hey @K2triinK
I am wrapping up this PR which answers some part of your questions, such as the model size reduction. Regarding the kernels, we are working on a plan to release it soon so that you can give it a try. Thanks, Reza