Add Support for quantized (4-bit) Models

Open apcameron opened this issue 2 years ago • 3 comments

Please Add Support for quantized (4-bit) Models so we can run models similar to llama.cpp and alpaca.cpp that only require 4GB of GPU memory

Apr 04 '23 13:04 apcameron

I don't know about 4-bit, but we should def support 8-bit. Want to do it?

May 27 '23 16:05 geohot

Unfortunately I am not a Python coder. Not sure if this will help https://huggingface.co/blog/4bit-transformers-bitsandbytes

May 28 '23 21:05 apcameron

@geohot I can try adding LLM.int8() quantization into tinygrad - refering https://arxiv.org/pdf/2208.07339.pdf and https://github.com/TimDettmers/bitsandbytes

May 29 '23 01:05 kunwar31

There's a bounty for uint8 LLaMA + an eval bench for LLMs

Jun 08 '23 20:06 geohot