Research: 4-bit quantization
Hi.
The paper describes 8-bit quantization combined with pruning, which is fantastic.
My question: has any research been done for 4-bit quantization? Since GPU memory is notoriously expensive, 4-bit quantization would allow running much bigger models (eg 70B models that require low latency, so ran on the GPU).
I'd be happy to contribute if someone could provide implementation guidance.
Hi @truenorth8. Thanks for your interest in contributing to deepsparse. There's actually been plenty of research into quantizing LLMs to 4 bits. Probably some of the most notorious examples are GPTQ and the subsequent extension SparseGPT. These algorithms are available on the nightly version of our library for sparsification of LLMs SparseML. Here's a link to the main entry point for it: https://github.com/neuralmagic/sparseml/blob/main/src/sparseml/transformers/sparsification/obcq/obcq.py.
Just to clarify, DeepSparse is our inference engine that supports fast execution of sparse and quantized models. DeepSparse at the moment supports CPUs only. However, you can use SparseML to produce compressed models and deploy in any platform.
@anmarques will 4-bit Q also come to YOLO models?
Hi @Fritskee there isn't a great motivation I see for <8bit YOLO models since going to lower precisions simply reduces weight memory usage, which is not a problem for YOLO. When optimization those architectures for performance you want to reduce the size of the large activations (images) or reduce the compute needed to perform the convolutions - this is why for YOLO we apply 8bit quantization and sparsity to reduce compute, see some models here https://sparsezoo.neuralmagic.com/?modelSet=computer_vision&tasks=detection&architectures=yolov8
Hi @Fritskee there isn't a great motivation I see for <8bit YOLO models since going to lower precisions simply reduces weight memory usage, which is not a problem for YOLO. When optimization those architectures for performance you want to reduce the size of the large activations (images) or reduce the compute needed to perform the convolutions - this is why for YOLO we apply 8bit quantization and sparsity to reduce compute, see some models here https://sparsezoo.neuralmagic.com/?modelSet=computer_vision&tasks=detection&architectures=yolov8
Thanks for the explanation!
Hello @Fritskee As there is no further comments here, I am going to go ahead and close out this issue. Feel free to re-open if you would like to continue the conversation. Regards, Jeannie / Neural Magic