Add Support for quantized (4-bit) Models
Please Add Support for quantized (4-bit) Models so we can run models similar to llama.cpp and alpaca.cpp that only require 4GB of GPU memory
I don't know about 4-bit, but we should def support 8-bit. Want to do it?
Unfortunately I am not a Python coder. Not sure if this will help https://huggingface.co/blog/4bit-transformers-bitsandbytes
@geohot I can try adding LLM.int8() quantization into tinygrad - refering https://arxiv.org/pdf/2208.07339.pdf and https://github.com/TimDettmers/bitsandbytes
There's a bounty for uint8 LLaMA + an eval bench for LLMs