ZeroQuant not compressing and making BERT slower

Open K2triinK opened this issue 3 years ago • 1 comments

Describe the bug I was expecting a compressed & faster BERT model after running the BERT ZeroQuant example in DeepSpeedExamples. However, the clean model isn't any smaller (still 417.7 MB) or faster (in fact, it's slower) than the original.

To Reproduce

Go to Google Colab and change to GPU runtime
Run the following: pip install deepspeed==0.7.0 git clone https://github.com/microsoft/DeepSpeedExamples cd DeepSpeedExamples/model_compression/bert (In the zero_quant.sh file, change master_port (e.g. to 9995) and task to sst2 & eval_batch_size to 32 (otherwise you'll get CUDA out of memory)) bash bash_script/ZeroQuant/zero_quant.sh

Expected behavior I expected the final clean model to be a compressed version of the original one, thus being smaller & faster but it isn't.

ds_report output

System info (please complete the following information):

OS: Ubuntu 18.04.6 LTS
1 Tesla T4 GPU
Tried with both 3.7.13 and 3.9

Aug 19 '22 06:08 K2triinK

Hey @K2triinK

I am wrapping up this PR which answers some part of your questions, such as the model size reduction. Regarding the kernels, we are working on a plan to release it soon so that you can give it a try. Thanks, Reza

Aug 19 '22 17:08 RezaYazdaniAminabadi