deepsparse Research: 4-bit quantization

Hi.

The paper describes 8-bit quantization combined with pruning, which is fantastic.

My question: has any research been done for 4-bit quantization? Since GPU memory is notoriously expensive, 4-bit quantization would allow running much bigger models (eg 70B models that require low latency, so ran on the GPU).

I'd be happy to contribute if someone could provide implementation guidance.

Oct 30 '23 21:10 truenorth8

Hi @truenorth8. Thanks for your interest in contributing to deepsparse. There's actually been plenty of research into quantizing LLMs to 4 bits. Probably some of the most notorious examples are GPTQ and the subsequent extension SparseGPT. These algorithms are available on the nightly version of our library for sparsification of LLMs SparseML. Here's a link to the main entry point for it: https://github.com/neuralmagic/sparseml/blob/main/src/sparseml/transformers/sparsification/obcq/obcq.py.

Just to clarify, DeepSparse is our inference engine that supports fast execution of sparse and quantized models. DeepSparse at the moment supports CPUs only. However, you can use SparseML to produce compressed models and deploy in any platform.

Nov 09 '23 20:11 anmarques

@anmarques will 4-bit Q also come to YOLO models?

Jan 15 '24 13:01 Fritskee

Hi @Fritskee there isn't a great motivation I see for <8bit YOLO models since going to lower precisions simply reduces weight memory usage, which is not a problem for YOLO. When optimization those architectures for performance you want to reduce the size of the large activations (images) or reduce the compute needed to perform the convolutions - this is why for YOLO we apply 8bit quantization and sparsity to reduce compute, see some models here https://sparsezoo.neuralmagic.com/?modelSet=computer_vision&tasks=detection&architectures=yolov8

Jan 15 '24 14:01 mgoin

Hi @Fritskee there isn't a great motivation I see for <8bit YOLO models since going to lower precisions simply reduces weight memory usage, which is not a problem for YOLO. When optimization those architectures for performance you want to reduce the size of the large activations (images) or reduce the compute needed to perform the convolutions - this is why for YOLO we apply 8bit quantization and sparsity to reduce compute, see some models here https://sparsezoo.neuralmagic.com/?modelSet=computer_vision&tasks=detection&architectures=yolov8

Thanks for the explanation!

Jan 15 '24 15:01 Fritskee

Hello @Fritskee As there is no further comments here, I am going to go ahead and close out this issue. Feel free to re-open if you would like to continue the conversation. Regards, Jeannie / Neural Magic

Apr 23 '24 19:04 jeanniefinks