Use exllamav2's smart 4-bit KV cache for memory benchmark

Open Interpause opened this issue 1 year ago • 1 comments

See: https://github.com/turboderp/exllamav2/blob/master/doc/qcache_eval.md

exllamav2 has a 4-bit KV cache that has similar ppl to unquantized cache from turboderp's testing. In practice, I find that exllamav2 uses less VRAM than llama.cpp for a given context size as a result. I noticed the exllamav2 benchmark code uses the unquantized cache. Could it be possible to use the 4-bit KV cache again for the memory usage benchmark? Thanks.

For reference, here's the class to use instead: https://github.com/turboderp/exllamav2/blob/009424a6d42d39efceeecd5562450180bd34a7fb/exllamav2/cache.py#L309

May 14 '24 07:05 Interpause

We can even add ExLlamav2 for float16 based on this comment. This also needs to be checked out.

May 15 '24 06:05 Anindyadeep